CN117392247A

CN117392247A - Image video semantic coding and decoding method and device based on sketch

Info

Publication number: CN117392247A
Application number: CN202311246377.0A
Authority: CN
Inventors: 段一平; 陶晓明; 施林苏; 余家忠; 靳志娟; 谢志鹏; 杜其原
Original assignee: Tsinghua University; China Tower Co Ltd
Current assignee: Tsinghua University; China Tower Co Ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2024-01-12

Abstract

The disclosure provides an image video semantic coding and decoding method and device based on a sketch, and relates to the technical field of data processing. The method comprises the following steps: acquiring an original image; acquiring a sketch of the original image; inputting the original image into a semantic segmentation network to obtain a semantic segmentation result; converting the sketch into a two-dimensional tensor to obtain a sketch tensor; converting the semantic segmentation result into a two-dimensional tensor to obtain a segmentation result tensor; modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor; generating a coded image according to the modified sketch tensor; and taking the encoded image as an additional condition, inputting the additional condition and random noise into a generator in a generating countermeasure network, and obtaining a decoded image.

Description

Image video semantic coding and decoding method and device based on sketch

Technical Field

The disclosure relates to the technical field of data processing, in particular to an image video semantic coding and decoding method and device based on a sketch.

Background

The classical image and video coding and decoding method is to compress the image and video into bit stream, package the bit stream and the parameters needed by decoding together in the form of data packet, then transmit to the receiving end through the channel, analyze the received image and video bit stream by the receiving end, determine the decoded parameters, and then decode. When image compression is performed, an image is usually divided into a plurality of blocks, and each block is encoded, which can cause blurring distortion and block artifacts to greatly affect user experience.

Disclosure of Invention

In view of the above, embodiments of the present disclosure provide a method and apparatus for image video semantic coding and decoding based on sketches, so as to overcome or at least partially solve the above problems.

In a first aspect of an embodiment of the present disclosure, there is provided an image video semantic coding and decoding method based on a sketch, the method including:

acquiring an original image, wherein the original image is a video frame or other images;

acquiring a sketch map of the original image, wherein each sketch point of the sketch map is used for representing an object structure;

inputting the original image into a semantic segmentation network to obtain a semantic segmentation result, wherein the semantic segmentation result is used for determining the category of each pixel point of the original image;

converting the sketch into a two-dimensional tensor to obtain a sketch tensor, wherein tensor elements in the sketch tensor are characterized by: whether each pixel point of the original image is the sketch point or not;

converting the semantic segmentation result into a two-dimensional tensor to obtain a segmentation result tensor, wherein tensor elements in the segmentation result tensor represent: the categories to which each pixel point of the original image belongs;

modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor, wherein the modified sketch tensor represents: each sketch point belongs to the category;

Generating a coded image according to the modified sketch tensor;

and taking the encoded image as an additional condition, inputting the additional condition and random noise into a generator in a generating countermeasure network, and obtaining a decoded image.

Optionally, modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor, including:

acquiring categories to which pixel points corresponding to each tensor element representing sketch points in the sketch map tensor belong according to the segmentation result tensor;

and modifying each tensor element representing the sketch point in the sketch map tensor according to the category to which the pixel point corresponding to each tensor element representing the sketch point in the sketch map tensor belongs, so as to obtain the modified sketch map tensor.

Optionally, the training step of the semantic segmentation network at least includes:

acquiring a plurality of image samples, wherein the plurality of image samples carry semantic segmentation tags;

inputting the plurality of image samples into an initial semantic segmentation network, and performing downsampling and convolution on the plurality of image samples by the initial semantic segmentation network to obtain high-dimensional features of the plurality of image samples;

Carrying out cavity convolution on the high-dimensional features of the plurality of image samples to obtain an initial semantic segmentation result of each image sample;

determining a first loss function value according to the initial semantic segmentation result and the semantic segmentation labels of each image sample;

training the initial semantic segmentation network based on the first loss function value to obtain the trained semantic segmentation network.

Optionally, the training step of generating the countermeasure network includes at least:

acquiring coded image samples corresponding to the plurality of image samples;

inputting the additional condition samples and random noise into an initial generator in an initial generation countermeasure network by taking the encoded image samples as additional condition samples to obtain a plurality of decoded image samples;

inputting a plurality of decoded image samples into an initial discriminator in the initial generation countermeasure network to obtain a first discrimination result, wherein the first discrimination result represents whether the decoded image samples are real images or not;

inputting a plurality of coded image samples into the initial discriminator to obtain a second discrimination result, wherein the second discrimination result represents whether the coded image samples are real images or not;

Determining a second loss function value according to the encoded image sample, the decoded image sample, the first discrimination result and the second discrimination result;

training the initial generated countermeasure network based on the second loss function value to obtain a trained generated countermeasure network, wherein the trained generated countermeasure network comprises the trained generator.

Optionally, the acquiring the sketch of the original image includes:

performing edge detection on the original image to obtain a structural feature diagram of the original image;

and carrying out non-maximum value inhibition on the structural feature map of the original image to obtain a sketch map of the original image, wherein the sketch map of the original image keeps the object outline and texture features of the original image.

In a second aspect of the embodiments of the present disclosure, there is provided an image video semantic codec apparatus based on a sketch, the apparatus including:

the image acquisition module is used for acquiring an original image, wherein the original image is a video frame or other images;

the sketch acquisition module is used for acquiring a sketch of the original image, and each sketch point of the sketch is used for representing an object structure;

The segmentation result acquisition module is used for inputting the original image into a semantic segmentation network to obtain a semantic segmentation result, and the semantic segmentation result is used for determining the category of each pixel point of the original image;

the first conversion module is used for converting the sketch into a two-dimensional tensor to obtain a sketch tensor, and tensor elements in the sketch tensor are characterized by: whether each pixel point of the original image is the sketch point or not;

the second conversion module is used for converting the semantic segmentation result into a two-dimensional tensor to obtain a segmentation result tensor, and tensor element characterization in the segmentation result tensor is as follows: the categories to which each pixel point of the original image belongs;

the modification module is used for modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor, and the modified sketch tensor represents: each sketch point belongs to the category;

the generation module is used for generating a coded image according to the modified sketch tensor;

and the decoding module is used for inputting the additional condition and random noise into a generator in a generating countermeasure network by taking the encoded image as the additional condition to obtain a decoded image.

Optionally, the modification module is specifically configured to:

acquiring coded image samples corresponding to the plurality of image samples;

Optionally, the image acquisition module is specifically configured to:

In a third aspect of the disclosed embodiments, there is provided an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute instructions to implement the sketch-based image video semantic codec method of the first aspect.

A fourth aspect of embodiments of the present disclosure provides a computer-readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the sketch-based image video semantic codec method of the first aspect.

Embodiments of the present disclosure include the following advantages:

in the embodiment of the disclosure, a sketch map of an original image and a semantic segmentation result are obtained, wherein each sketch point of the sketch map is used for representing an object structure, and the semantic segmentation result is used for determining the category to which each pixel point of the original image belongs. In this way, the encoded image generated based on the sketch and the semantic segmentation result uses the image structure information and the image semantic information. Further, when the decoded image is generated by the generator, the image quality of the generated decoded image is high with the encoded image as an additional condition, and the image structure information and the image semantic information are retained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a flow chart of steps of a sketch-based image video semantic codec method in an embodiment of the present disclosure;

FIG. 2 is a block diagram of a sketch-based image video semantic codec method in an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an image video semantic codec device based on a sketch in an embodiment of the disclosure.

Detailed Description

In order that the above-recited objects, features and advantages of the present disclosure will become more readily apparent, a more particular description of the disclosure will be rendered by reference to the appended drawings and appended detailed description.

Images have been widely studied as one of the basic carriers of multimedia information, such as image generation, storage, transmission, processing, understanding, and the like. With the continuous development of multimedia and internet technologies, the tremendous increase of multimedia information flow makes people put forward new requirements on high-resolution and low-delay image transmission, and the rapid development of 5G communication technology and deep learning accordingly provide new research ideas for the wide application of the technology in the field of image processing and the more efficient and higher-quality image transmission modes. Currently, due to bandwidth limitation and low stability of mobile terminal equipment, when a large number of pictures are transmitted, only low resolution is transmitted in advance, and a preview picture with low bit number is used for saving bandwidth and reducing time delay, however, this mode greatly reduces user experience, because there is no way to determine the resolution of the preview picture according to the requirements of users. This means that the conventional image transmission method may not fully meet the current multimedia information transmission requirement, and a higher compression rate, high speed and high quality image transmission method is required.

Conventional image coding and decoding algorithms generally utilize a block compression and transform domain coding manner to code and decode an image, and do not consider the effective value of image structure information and image semantic information in processing the image. The traditional image coding and decoding frameworks such as JPEG with wide application do not take the potential of structural information, semantic information and the like of the image into account for improving the image coding and decoding modes, but only utilize classical algorithms such as discrete cosine transform, huffman coding and the like to compress the image into binary data, and do not utilize the structural information and the semantic information of the image.

In order to solve the technical problems, an embodiment of the disclosure provides an image video semantic coding and decoding method based on a sketch.

Referring to fig. 1, a step flowchart of an image video semantic coding and decoding method based on a sketch in an embodiment of the disclosure is shown, and as shown in fig. 1, the image video semantic coding and decoding method based on a sketch specifically may include steps S11 to S18.

Step S11: an original image is acquired.

The original image is a video frame or other image.

The image video semantic coding and decoding method based on the sketch provided by the embodiment of the disclosure can be used for coding and decoding images and also can be used for coding and decoding videos. In encoding and decoding video, the video may be converted into a plurality of video frames, and then each video frame is encoded and decoded.

The original image may be any image of a video frame, a photograph, a drawing, etc.

Step S12: and obtaining the sketch of the original image.

Each sketch point of the sketch is used for representing the object structure.

The sketch of the original image is an image with the same size as the original image, and only sketch points describing the structure of the object in the original image are reserved in the sketch. The sketch points in the sketch map of the original image can be determined by utilizing a feature extraction module of an hourglass structure, wherein the feature extraction module comprises an up-sampling sub-module and a down-sampling sub-module which are the same in number, and a plurality of convolution layers with different scales, and the last convolution layer is provided with a single convolution kernel, and a sigmoid activation function is arranged behind the convolution layer so as to obtain the probability that each pixel point of the original image belongs to the sketch point. And determining a plurality of sketch points from each pixel point of the original image according to the probability that each pixel point belongs to the sketch point, and generating a sketch map of the original image according to the plurality of sketch points.

Alternatively, as an embodiment, acquiring a sketch of the original image may include: performing edge detection on the original image to obtain a structural feature diagram of the original image; and carrying out non-maximum value inhibition on the structural feature map of the original image to obtain a sketch map of the original image, wherein the sketch map of the original image keeps the object outline and texture features of the original image.

The structural feature map of each object in the original image can be extracted by using an edge detection algorithm. Each pixel point in the structural feature map of each object in the original image is a pixel point with obvious color change or brightness change in the original image. When the structural features of each object in the original image are extracted by utilizing an edge detection algorithm, pixel points describing the outline of each object in the original image can be roughly detected first, then the pixel points describing the outline are connected through a plurality of connection rules, finally a plurality of boundary points which are not recognized before are detected and connected, and the detected false pixel points and boundary points are removed to form an integral edge. Alternatively, the structural features of each object in the original image may be extracted based on Sobel (a commonly used edge detection algorithm), prewitt (an edge detection of a first-order differential operator), roberts (a local differential operator searching for edges), canny (a multi-stage edge detection algorithm), and other edge detection operators.

After the structural feature map of the original image is extracted, a single pixel edge can be obtained by non-maximum suppression. The non-maximum value suppression is used for searching the edge intensity value of the pixel point in the gradient direction to judge whether the pixel point obtains the local maximum value or not, so that the non-maximum value point is suppressed, and the edge intensity graph of a single pixel is obtained. Furthermore, according to the probability that each pixel point of the original image belongs to the sketch point, the sketch point of the sketch map of the original image can be obtained. A sketch of the original image may preserve the object contours and texture features of the original image.

Step S13: inputting the original image into a semantic segmentation network to obtain a semantic segmentation result.

The semantic segmentation result is used for determining the category to which each pixel point of the original image belongs.

The semantic segmentation result of the original image may be characterized as an image of the same size as the original image. The original image is subjected to semantic segmentation, and the category of each pixel point in the original image can be determined, so that a computer can understand objects in the original image more precisely and accurately, and semantic information of the image can be utilized better. The categories to which each pixel point belongs may be set according to requirements, for example, in the case that the original image is a street view photo, each category may include: background, vehicles, pedestrians, plants, roads, etc.

The original image can be subjected to semantic segmentation by using a structured random forest, an SVM (Support Vector Machine ) and the like, and can also be subjected to semantic segmentation by using a deep learning method. A semantic segmentation network can also be trained, and the original image is input into the semantic segmentation network, so that a semantic segmentation result is obtained. The semantic segmentation results characterize the category to which each pixel point in the original image belongs. The training method of the semantic segmentation network will be described in detail later.

Step S14: and converting the sketch into a two-dimensional tensor to obtain a sketch tensor.

Tensor element characterization in the sketch tensor: whether each pixel point of the original image is the sketch point or not.

The two-dimensional tensor may be equivalently a matrix, and the sketch map may be converted into the two-dimensional tensor according to a predetermined rule. For example, in the case where the size of the sketch is 100×100, the sketch tensor may be 100×100, where each tensor element in the sketch tensor corresponds to a pixel point in the sketch, the tensor element in the sketch tensor whose corresponding pixel point is a sketch point may be characterized as 1, and the tensor element in the sketch tensor whose corresponding pixel point is not a sketch point may be characterized as 0.

Step S15: and converting the semantic segmentation result into a two-dimensional tensor to obtain a segmentation result tensor.

Tensor element characterization in the segmentation result tensor: and the categories of the pixel points of the original image belong to.

The semantic segmentation result may be converted into a two-dimensional tensor according to a predetermined rule. For example, in the case where the semantic segmentation result includes 6 categories, the 6 categories may be respectively represented by numbers 0 to 5, where the background category may be represented as 0, and then the semantic segmentation result may be represented by a two-dimensional segmentation result tensor having the same size as the semantic segmentation result. Wherein each tensor element in the segmentation result tensor is any one of 0 to 5, and the value of each tensor element is a number corresponding to the category to which the pixel point corresponding to the tensor element belongs.

Step S16: and modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor.

The modified sketch tensor representation: and each sketch point belongs to the category.

The size of the segmentation result tensor and the size of the sketch tensor are the same as the size of the original image; the tensor elements in the segmentation result tensor are in one-to-one correspondence with the pixels in the original image, and the tensor elements in the sketch tensor are in one-to-one correspondence with the pixels in the original image. Therefore, tensor elements in the segmentation result tensor are in one-to-one correspondence with tensor elements in the sketch tensor.

The modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor may include: acquiring categories to which pixel points corresponding to each tensor element representing sketch points in the sketch map tensor belong according to the segmentation result tensor; and modifying each tensor element representing the sketch point in the sketch map tensor according to the category to which the pixel point corresponding to each tensor element representing the sketch point in the sketch map tensor belongs, so as to obtain the modified sketch map tensor.

The modification of the sketch tensor according to the segmentation result tensor may be to modify the values of tensor elements representing sketch points in the sketch tensor into the values of corresponding tensor elements in the segmentation result tensor.

For example, a tensor element of the sketch tensor is 0 or 1, wherein a tensor element of 1 indicates that a pixel corresponding to the tensor element is a sketch point, and a tensor element of 0 indicates that a pixel corresponding to the tensor element is not a sketch point. When the category to which the pixel point corresponding to one sketch point belongs is the category represented by the numeral 5, modifying the tensor element corresponding to the sketch point in the sketch map tensor from the numeral 1 to the numeral 5. Setting the number corresponding to the background category to 0, and representing tensor elements other than 0 in the modified sketch tensor: the pixel point corresponding to the tensor element is a sketch point, and the category to which the sketch point belongs can be determined through the value of the tensor element.

Step S17: and generating an encoded image according to the modified sketch tensor.

And recovering the modified sketch tensor into an image to obtain an encoded image.

Step S18: and taking the encoded image as an additional condition, inputting the additional condition and random noise into a generator in a generating countermeasure network, and obtaining a decoded image.

The encoded image is input as an additional condition to a generator in the generation countermeasure network together with random noise, and a decoded image output by the generator can be obtained. The generating countermeasure network comprises a generator and a discriminator, and countermeasure training is carried out between the generator and the discriminator. The training method of generating the countermeasure network will be described in detail later.

By adopting the technical scheme of the embodiment of the disclosure, the sketch map of the original image and the semantic segmentation result are obtained, wherein each sketch point of the sketch map is used for representing the object structure, and the semantic segmentation result is used for determining the category to which each pixel point of the original image belongs. In this way, the encoded image generated based on the sketch and the semantic segmentation result uses the image structure information and the image semantic information. Further, when the decoded image is generated by the generator, the image quality of the generated decoded image is high with the encoded image as an additional condition, and the image structure information and the image semantic information are retained.

Alternatively, as an embodiment, the training step of the semantic segmentation network may at least include: acquiring a plurality of image samples, wherein the plurality of image samples carry semantic segmentation tags; inputting the plurality of image samples into an initial semantic segmentation network, and performing downsampling and convolution on the plurality of image samples by the initial semantic segmentation network to obtain high-dimensional features of the plurality of image samples; carrying out cavity convolution on the high-dimensional features of the plurality of image samples to obtain an initial semantic segmentation result of each image sample; determining a first loss function value according to the initial semantic segmentation result and the semantic segmentation labels of each image sample; training the initial semantic segmentation network based on the first loss function value to obtain the trained semantic segmentation network.

The image sample can be an image in an open-element semantic segmentation dataset in the Internet, or an image with a semantic segmentation label marked manually. The initial semantic segmentation network refers to a semantic segmentation network which is not trained yet, and the initial semantic segmentation result refers to the result of semantic segmentation on the image sample by the semantic segmentation network which is not trained yet. And inputting the image sample into an initial semantic segmentation network to obtain an initial semantic segmentation result output by the initial semantic segmentation result.

The initial semantic segmentation network may be a deep learning network based on deep labv3 algorithm. The initial semantic segmentation network can comprise a downsampling module and a convolution layer, the image samples are input into the downsampling module and the convolution layer of the initial semantic segmentation network, and downsampling and convolution are carried out on the image samples, so that high-dimensional characteristics of the image samples can be obtained. The initial semantic segmentation network performs cavity convolution on the high-dimensional features of the image samples, so that the receptive field can be increased while the resolution of the high-dimensional features of the image samples is not reduced, the context information in a larger range is obtained, and the fineness of the semantic segmentation result is improved.

Training network parameters in a random gradient descent mode, specifically, after obtaining an initial semantic segmentation result of each image sample, determining a first loss function value according to the initial semantic segmentation result and the semantic segmentation label of each image sample; training the initial semantic segmentation network based on the first loss function value to obtain a trained semantic segmentation network.

Optionally, as an embodiment, the training step of generating the countermeasure network includes at least: acquiring coded image samples corresponding to the plurality of image samples; inputting the additional condition samples and random noise into an initial generator in an initial generation countermeasure network by taking the encoded image samples as additional condition samples to obtain a plurality of decoded image samples; inputting a plurality of decoded image samples into an initial discriminator in the initial generation countermeasure network to obtain a first discrimination result, wherein the first discrimination result represents whether the decoded image samples are real images or not; inputting a plurality of coded image samples into the initial discriminator to obtain a second discrimination result, wherein the second discrimination result represents whether the coded image samples are real images or not; determining a second loss function value according to the encoded image sample, the decoded image sample, the first discrimination result and the second discrimination result; training the initial generated countermeasure network based on the second loss function value to obtain a trained generated countermeasure network, wherein the trained generated countermeasure network comprises the trained generator.

The training generates image samples of the countermeasure network, which may be any image, and may be the same as or different from the image samples employed to train the semantic segmentation network.

The method of acquiring the encoded image samples corresponding to the plurality of image samples may refer to a method of acquiring the encoded image corresponding to the original image. Obtaining a sketch map and a semantic segmentation result of the image sample, further obtaining a sketch map tensor and a segmentation result tensor of the image sample, further obtaining a modified sketch map tensor, and recovering the modified sketch map tensor of the image sample to obtain a coded image sample corresponding to the image sample.

The generation of the antagonism network may be implemented with a Pytorch (open source machine learning library) framework. The generating countermeasure network comprises a generator G and a discriminator D, the main function of the generator is to learn the probability distribution of the target through a training process, x represents the random noise of the input end of the generator, y represents the output target signal, and then the process generated by the generator can be expressed as y=g (x). The target signal in the embodiments of the present disclosure is a decoded image. The generation of the countermeasure network according to the embodiment of the present disclosure generates the countermeasure network on the condition that the generator learns, by inputting additional conditions, the generator can grasp the probability distribution of the target signal more accurately, and this process can be expressed as y=g (x|z), where x represents random noise at the input end, z represents additional conditions, and y represents the output of the generator under the accessory condition. The method comprises the steps of using G (-) to represent a decoded image sample output by an initial generator, wherein "·" represents input information, D (-) represents a discrimination result output by an initial discriminator, inputting the decoded image sample into the initial discriminator can obtain a first discrimination result, and inputting the encoded image sample into the initial discriminator can obtain a second discrimination result. From the encoded image sample, the decoded image sample, the first discrimination result and the second discrimination result, a second loss can be obtained Value of the loss function

Wherein,the meaning of the remaining characters may be referred to above for characterizing the desire.

Training the initially generated countermeasure network based on the second loss function value may result in a trained generated countermeasure network, the trained generated countermeasure network including a trained generator.

Generating an optimization objective against the network allows the output of the generator to pass the test of the arbiter, and further adds additional terms to the loss function so that the network output is closer to the true target value. Therefore, the L1 loss function can also be selected as a traditional loss function to measure the difference between the output value and the true value of the generator. The L1 penalty function can be expressed as:

the random noise can be introduced in a mode of randomly inactivating the neurons, and the noise can be introduced by adopting the method, so that the output of the generator is more stable, and the noise at the output end is reduced. In order to construct a deep neural network, the residual learning theory can be also utilized, in the process of generating an countermeasure network, the input of a sub-network of the generator is x, the output of the sub-network is y, the mapping relation from x to y which needs to be learned through the training process is denoted by f (·), and then the relation of the three is y=f (x). According to the theory of deep residual error learning, compared with directly letting the sub-network learn the mapping relation f (·), when the algorithm of residual error deep learning, letting the sub-network obtain the residual error mapping relation:

r(x)＝f(x)-x

Wherein r (x) characterizes the residual.

The residual mapping relationship is easier to obtain through training than the target mapping relationship. Once, the target mapping may be derived from the residual mapping as f (x) =r (x) +x. If a plurality of nonlinear layers are cascaded behind the complete deep neural network model, the identity mapping relation is difficult to obtain through the cascaded nonlinear layers, so that the problem that the output accuracy is reduced when the number of layers of the neural network is deepened is generated, and the problem can be well avoided by modifying the mapping relation required to be learned by the neural network into a residual mapping relation. In addition, through a mode of cascade between neural network layers, the output characteristic information at the front end can be directly obtained by network layers behind a plurality of layers, for example, low-level information such as the tone of an image, main outline and the like can span a plurality of layers of the network, and the performance of the neural network layer receiving the front-end characteristic information is improved. The generation of the countermeasure network training process is carried out according to the following steps:

(a) Constructing a deep neural network and various required data preprocessing functions by using a Pytorch framework;

(b) Initializing and generating parameters of each layer of an countermeasure network, obtaining a coded image sample from a data set through a data preprocessing module and a coding module, inputting the coded image sample into a decoding module, and optimizing network parameters according to an Adam algorithm and a loss function;

(c) And (5) performing quality evaluation on the decoded image by using an evaluation module, and storing the model obtained through training.

Fig. 2 is a block diagram of an image video semantic codec method based on a sketch in an embodiment of the present disclosure. The original image may be a video frame or any image. The preprocessing module comprises a sketch extraction module and a semantic segmentation module, wherein the sketch extraction module performs sketch extraction on the original image to obtain a sketch map of the original image; the semantic segmentation module performs semantic segmentation on the original image to obtain a semantic segmentation result of the original image. The encoding module converts the sketch and the semantic segmentation result into two-dimensional tensors respectively, then changes the tensor element value of each sketch point representing the object structure into the semantic segmentation result of the corresponding position, obtains the modified sketch tensor, and obtains the encoded image according to the modified sketch tensor. The encoded image is input into a generation countermeasure network, a decoding module is arranged in the generation countermeasure network, and a generator in the decoding module can generate a reconstructed image which is a decoded image. Alternatively, the reconstructed image may also be evaluated using an evaluation, the image quality of the reconstructed image being evaluated by the original image and the reconstructed image. The reconstructed image can be evaluated by three objective evaluation indexes of MSE (mean square error), PSNR (peak signal to noise ratio) and SSIM (structural similarity), wherein the reconstructed image can be realized by calling a skimage interface in python.

By adopting the technical scheme of the embodiment of the disclosure, the characteristics that the sketch keeps object structure information and the semantic segmentation result keeps image semantic information are fully utilized, and the sketch and the semantic segmentation result are combined to generate a novel image encoding and decoding method, so that the image compression rate is improved. When the image is subjected to semantic segmentation, the advantages of feature receptive field and semantic segmentation fineness are increased by utilizing cavity convolution, network parameters are optimized by means of random gradient descent in a supervised learning mode, and stability and accuracy of a semantic segmentation network are improved. The decoding module utilizes the generation countermeasure network of the additional condition, when the network is built, the feature transfer among different network layers is ensured by an interlayer cascading mode, the gradient disappearance or gradient explosion can not occur when the network layers are overlarge, and the loss function not only comprises the classification loss of the discriminator, but also comprises the difference between the network output and the original image after quantization. The evaluation module evaluates the reconstructed image by using three objective evaluation indexes, namely objective evaluation index MSE, PSNR, SSIM, so as to avoid deviation caused by subjective evaluation.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the disclosed embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the disclosed embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the disclosed embodiments.

Fig. 3 is a schematic structural diagram of an image video semantic codec device based on a sketch in an embodiment of the present disclosure, as shown in fig. 3, where the device includes an image acquisition module, a sketch acquisition module, a segmentation result acquisition module, a first conversion module, a second conversion module, a modification module, a generation module, and a decoding module, where:

Optionally, the modification module is specifically configured to:

acquiring coded image samples corresponding to the plurality of image samples;

Optionally, the image acquisition module is specifically configured to:

It should be noted that, the device embodiment is similar to the method embodiment, so the description is simpler, and the relevant places refer to the method embodiment.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the disclosed embodiments may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present disclosure may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present disclosure are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices, and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the disclosed embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the disclosed embodiments.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device comprising the element.

The image video semantic coding and decoding method and device based on sketch provided by the present disclosure are described in detail, and specific examples are applied to illustrate the principles and embodiments of the present disclosure, and the description of the above examples is only used for helping to understand the method and core ideas of the present disclosure; meanwhile, as one of ordinary skill in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present disclosure, the contents of the present specification should not be construed as limiting the present disclosure in summary.

Claims

1. An image video semantic coding and decoding method based on a sketch is characterized by comprising the following steps:

generating a coded image according to the modified sketch tensor;

2. The method according to claim 1, wherein modifying the sketch tensor according to the segmentation result tensor to obtain a modified sketch tensor comprises:

3. The method according to claim 1, wherein the training step of the semantic segmentation network comprises at least:

4. The method of claim 1, wherein the training step of generating an countermeasure network comprises at least:

acquiring coded image samples corresponding to the plurality of image samples;

5. The method of claim 1, wherein the acquiring a sketch of the original image comprises:

6. An image video semantic coding and decoding device based on sketch, which is characterized by comprising:

7. The apparatus of claim 6, wherein the modification module is specifically configured to:

8. The apparatus of claim 6, wherein the training step of the semantic segmentation network comprises at least:

9. The apparatus of claim 6, wherein the training step of generating the countermeasure network comprises at least:

acquiring coded image samples corresponding to the plurality of image samples;

10. The apparatus of claim 6, wherein the image acquisition module is specifically configured to: