WO2022222842A1

WO2022222842A1 - Dynamic image encoding and decoding methods, apparatus and device and storage medium

Info

Publication number: WO2022222842A1
Application number: PCT/CN2022/086880
Authority: WO
Inventors: 闫宁; 陈焕浜; 李照洋; 马飞龙; 宋星光; 周建同; 杨海涛; 李江
Original assignee: 华为技术有限公司
Priority date: 2021-04-19
Filing date: 2022-04-14
Publication date: 2022-10-27
Also published as: CN115225901A

Abstract

Embodiments of the present application disclose dynamic image encoding and decoding methods, an apparatus and device and a storage medium, which belong to the technical field of encoding and decoding. In the encoding method, semantic segmentation is performed on any image frame in a dynamic image to obtain an image segmentation mask, the dynamic image comprising multiple objects, and the image segmentation mask comprising multiple image regions in one-to-one correspondence to multiple objects; a moving image sequence is determined on the basis of the dynamic image, each image frame in the moving image sequence comprising an image region in which one or more moving objects among the multiple objects are located; position indication information is determined on the basis of the image segmentation mask, the position indication information being used to indicate the position of the image region in which the one or more moving objects are located; and the moving image sequence and the position indication information are encoded into a code stream. The embodiments of the present application improve encoding efficiency, and effectively reduce decoding complexity and power consumption.

Description

Dynamic image encoding and decoding method, device, device and storage medium

This application claims the priority of the Chinese patent application with application number 202110421196.1 filed on April 19, 2021 and the invention title is "Encoding and decoding method, device, device and storage medium for dynamic images", the entire contents of which are incorporated by reference in in this application.

technical field

The embodiments of the present application relate to the technical field of encoding and decoding, and in particular, to a method, apparatus, device, and storage medium for encoding and decoding a dynamic image.

Background technique

A dynamic image is a media format between a static image and a video. It is an image that switches a group of static images at a specified frequency to produce a dynamic effect. Compared with static images, dynamic images have multiple frames of images, and there are temporal correlations among the multiple frames of images. Compared with video, dynamic images have less inter-frame correlation and no fixed frame rate.

Compared with video, the current dynamic image codec has the characteristics of light weight and low power consumption, and does not use streaming transmission. At present, the most widely used dynamic image encoding and decoding method is the Graphics Interchange Format (GIF). Application requirements for dynamic images.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a dynamic image encoding and decoding method, apparatus, device, and storage medium, which can improve encoding efficiency and reduce decoding complexity and power consumption. The technical solution is as follows:

A first aspect provides a method for encoding a dynamic image. In the method, semantic segmentation is performed on any frame of image in the dynamic image to obtain an image segmentation mask, the dynamic image includes a plurality of objects, and the image segmentation mask includes Multiple image regions corresponding to multiple objects one-to-one. Based on the moving images, a moving image sequence is determined, and each frame of the image in the moving image sequence includes an image area where one or more moving objects among the plurality of objects are located. Based on the image segmentation mask, the position indication information is determined, and the position indication information is used to indicate the position of the image area in which the one or more moving objects are located. The moving image sequence and position indication information are encoded into the code stream.

Since only the image area where the moving object is located in the dynamic image will change, the image area where the stationary object is located will not change, and each frame of image in the moving image sequence includes one or more moving objects among the multiple objects The image area in which it is located, the position indication information is used to indicate the position of the image area where the one or more moving objects are located. For dynamic images, there is no need to encode the image area where the still object is located into the code stream, which improves the encoding efficiency.

Since the location area of each object in the dynamic image is basically unchanged, and only the object itself changes, the embodiment of the present application can perform semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask. Usually, the first frame image in the dynamic image can be semantically segmented to obtain an image segmentation mask.

In addition, since the image segmentation mask includes multiple image regions corresponding to the multiple objects one-to-one, in order to facilitate distinguishing the objects, the image regions corresponding to the multiple objects are usually represented by different pixel values. Corresponding image regions are represented by the same pixel value.

It should be noted that each object in the dynamic image may be a single individual in the dynamic image. For example, in the case where the dynamic image includes the user, grass, hillside, river and sky, the multiple objects in the dynamic image include the user, grass, hillside, river and sky.

In addition, the plurality of objects included in the dynamic image are generally divided into moving objects and stationary objects. A moving object refers to an object that changes itself, and can also be called an object in a state of motion. For example, the water in the river in the dynamic image changes, and the user's facial features or limbs change, so the river and the user can be called moving objects. A stationary object refers to an object that does not change itself, and can also be called an object in a stationary state. For example, the grass, hillside and sky in a dynamic image do not change, so the grass, hillside and sky can be called stationary objects.

It should be noted that the moving image sequence may include one or more sub-image sequences corresponding to the one or more moving objects, or may be the moving image itself. The position indication information may be an image segmentation mask, or may be the coordinates in the dynamic image of the specified position of the image region where each moving object of the one or more moving objects is located. Therefore, the following will be divided into various cases to introduce.

In the first case, the moving image sequence includes one or more sub-image sequences, and the position indication information is an image segmentation mask.

In the first case, the implementation process of determining the moving image sequence based on the dynamic image includes: based on the image segmentation mask and the dynamic image, extracting one or more sub-image sequences, the one or more sub-image sequences and the one or more sub-image sequences One-to-one correspondence between moving objects.

The extraction method of the sub-image sequence corresponding to each moving object is the same. Therefore, a moving object can be selected from the one or more moving objects, and the sub-image sequence corresponding to the selected moving object can be determined according to the following operations, until each Up to the sub-image sequence corresponding to the moving object: Based on the image segmentation mask, determine the location area where the selected moving object is located. The image area where the selected moving object is located is extracted from , and the sub-image sequence corresponding to the selected moving object is obtained.

Because the image segmentation mask includes multiple image regions corresponding to the multiple objects, that is to say, the image segmentation mask has already divided the image region where each object of the multiple objects is located, and based on the above The paper describes that the image area where the same object is located in the image segmentation mask is represented by the same pixel value, and the image area where different objects are located is represented by different pixel values. Therefore, based on the image segmentation mask, the realization process of determining the location area where the selected moving object is located includes: scanning each pixel in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, the pixel coordinate set Include the coordinates of multiple pixels. The location area formed by the set of pixel coordinates corresponding to the selected moving object is determined as the location area where the selected moving object is located.

That is, by scanning each pixel in the image segmentation mask, the pixels whose pixel value is the corresponding pixel value of the selected moving object are determined, and the coordinates of these pixel points are determined as the corresponding pixel value of the selected moving object. The set of pixel coordinates can further determine the location area where the selected moving object is located, the location area refers to the actual location of the selected moving object, and the boundary of the location area is the outline of the selected moving object.

Usually, the area formed by the outline of the moving object is an irregular area, that is, the location area where the moving object is located is not a regular area. Therefore, the motion can be directly extracted from each frame of the dynamic image except the first frame of image. The image area within the location area where the object is located. Of course, in other embodiments, the location area where the moving object is located can also be processed as a regular area, and then the image area within the regular area is extracted from each frame of images in the dynamic image except the first frame of image .

That is, based on the position area of the selected moving object, the realization process of extracting the image area where the selected moving object is located from each frame of the image except the first frame image in the dynamic image includes: removing the first image area from the dynamic image. In each frame of images other than one frame of image, the image area located in the position area where the selected moving object is located is extracted. Or, expand the position area where the selected moving object is located, so that the expanded position area is a square area, and extract the expanded position area from each frame of the dynamic image except the first frame image. within the image area.

It should be noted that there are various implementations for extending the location area where the moving object is located, for example, determining the minimum abscissa, the minimum ordinate, the maximum abscissa and the maximum ordinate from the set of pixel coordinates corresponding to the selected moving object. , and then, determine a square area where the abscissa is between the minimum abscissa and the maximum abscissa, and the ordinate is between the minimum ordinate and the maximum ordinate, and the square area is determined as the expanded location area. Or, directly based on the location area where the moving object is located, a square area circumscribing the location area is drawn, and the circumscribing square area is determined as the expanded location area.

In the second case, the moving image sequence includes one or more sub-image sequences, and the location indication information includes coordinates of one or more designated locations.

In the second case, the implementation process of determining the moving image sequence based on the dynamic image includes: based on the image segmentation mask and the dynamic image, extracting one or more sub-image sequences, the one or more sub-image sequences and the one or more sub-image sequences One-to-one correspondence between moving objects. At this time, based on the image segmentation mask, the implementation process of determining the position indication information includes: based on the image segmentation mask, determining that the specified position in the image area where each moving object of the one or more moving objects is located is in the dynamic image coordinate of.

For the content in the second case, reference may be made to the relevant description in the above-mentioned first case, which is not repeated in this embodiment of the present application.

It should be noted that the designated position in the image area where the moving object is located may be the position with the smallest coordinates, the position with the largest coordinates, or the position of the geometric center point. Certainly, other positions may also be used, which are not limited in this embodiment of the present application.

Optionally, in the second case, the number of the one or more moving objects may also be encoded into the code stream. In this way, for the decoding end, based on the number of the one or more moving objects, it can be determined whether there is a sub-image sequence that fails to transmit in the one or more sub-image sequences, thereby ensuring the reliability of dynamic image decoding.

In the third case, the moving image sequence is a moving image, and the position indication information is an image segmentation mask.

In the fourth case, the moving image sequence is a moving image, and the position indication information is an image segmentation mask. At this time, the method further includes: determining, based on the image segmentation mask, a plurality of segmentation regions corresponding to the plurality of objects one-to-one; The image is divided into regions to obtain multiple image regions. An object state corresponding to each of the plurality of divided regions is determined, and the object state includes a static state or a motion state. In this way, the implementation process of encoding the moving image sequence into the code stream includes: encoding the plurality of image regions into the code stream. The method further includes: encoding the object state corresponding to each of the plurality of divided regions into the code stream.

Based on the image segmentation mask, an implementation process of determining a plurality of segmentation regions corresponding to the plurality of objects one-to-one includes: determining, based on the image segmentation mask, a location region where each object of the plurality of objects is located. In the case where the location area where any one of the multiple objects is located does not contain an integer number of coding tree units (coder tree units, CTUs), the boundary of the location area where any object is located is extended, so that any object is located The location area you are in contains an integer number of CTUs. The location areas where the multiple objects are located after the expansion processing are determined as the multiple divided areas.

That is, after the expansion processing is performed, the location area where each object is located includes an integer number of CTUs. At this time, the position area after the expansion process can be determined as the divided area. That is, each of the plurality of divided areas includes an integer number of CTUs.

In this case, the implementation process of encoding the multiple image areas into the code stream includes: encoding each image area in the multiple image areas as an encoding block into the code stream respectively. Or, an area composed of each row of CTUs in each of the multiple image areas is encoded into the code stream as a code block. Wherein, the location area where the reference coding block is located is located in the location area where the referenced coding block is located.

Since each image area includes an integer number of CTUs, the entire image area (tile) is coded as a coding block into the code stream separately, or the area (slice) composed of each row of CTUs in each image area is used as a coding block. Encoded into the code stream separately, so that it can be decoded separately during subsequent decoding.

In addition, for a coding block, the decoding of this coding block may need to refer to the coding block in a certain frame image before the current frame, that is, the decoding of a coding block in the current frame depends on the reference frame. Therefore, in order to be able to decode successfully, it is necessary to limit the location area of the encoding block in the reference frame to be located in the location area of the encoding block of the current frame, so that the current encoding block can be decoded on the basis of the reference encoding block. .

For the above four cases, the first frame image of the dynamic image can also be encoded into the code stream.

In a second aspect, a method for decoding a moving image is provided. In the method, a first frame image is parsed from a code stream, and a moving image sequence and position indication information are parsed from the code stream. Each frame of image includes an image area where one or more moving objects are located, and the position indication information is used to indicate the location of the image area where the one or more moving objects are located. Based on the moving image sequence and the position indication information, the image area where the one or more moving objects are located is rendered and displayed in the first frame of image to obtain a moving image.

That is, when decoding a dynamic image, after the first frame of image is decoded, only the image area where the moving object is located for subsequent images needs to be decoded, and there is no need to decode the image area where the still object is located, which effectively reduces the need for decoding. Decoding complexity and power consumption. Moreover, in the process of displaying the dynamic image, it is only necessary to render and refresh the image area where the moving object is located on the basis of the first frame of image, thereby effectively reducing the power consumption of the display.

In the first case, the moving image sequence includes one or more sub-image sequences, and the one or more sub-image sequences are in one-to-one correspondence with one or more moving objects. The position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects.

In the first case, based on the moving image sequence and the position indication information, the implementation process of rendering and displaying the image area where the one or more moving objects are located in the first frame image includes: from the one or more moving objects Select a moving object from the moving objects, and render and display the image area where the selected moving object is located according to the following operations, until the image area where each moving object is located is rendered and displayed: Based on the image segmentation mask, determine The position of the image area where the selected moving object is located. According to the position of the image area where the selected moving object is located, the image area included in the sub-image sequence corresponding to the selected moving object is rendered and displayed in the first frame of image.

Because the image segmentation mask includes multiple image regions corresponding to the multiple objects, that is to say, the image segmentation mask has already divided the image region where each object of the multiple objects is located, and based on the above The paper describes that the image area where the same object is located in the image segmentation mask is represented by the same pixel value, and the image area where different objects are located is represented by different pixel values. Therefore, based on the image segmentation mask, the realization process of determining the position of the image area where the selected moving object is located includes: scanning each pixel in the image segmentation mask to obtain a set of pixel coordinates corresponding to the selected moving object. The pixel coordinate set includes the coordinates of a plurality of pixel points. The location area formed by the pixel coordinate set is determined as the position of the image area where the selected moving object is located, or the location area formed by the pixel coordinate set is expanded, so that the expanded location area is a square area, and the expanded location area is a square area. The latter position area is determined as the position of the image area where the selected moving object is located.

That is, by scanning each pixel in the image segmentation mask, the pixels whose pixel value is the corresponding pixel value of the selected moving object are determined, and the coordinates of these pixel points are determined as the corresponding pixel value of the selected moving object. The set of pixel coordinates can then determine the position of the image area where the selected moving object is located in the dynamic image.

Usually, the area formed by the outline of the moving object is an irregular area, that is, the location area formed by the pixel coordinate set is not a regular area. Therefore, in some embodiments, the location formed by the pixel coordinate set corresponding to the moving object can be directly The area is determined as the position of the image area where the moving object is located in the dynamic image. Of course, in other embodiments, the position area formed by the pixel coordinate set may also be processed as a regular area, and then the position of the regular area is determined as the position of the image area where the moving object is located in the dynamic image.

In the second case, the moving image sequence includes one or more sub-image sequences, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects. The position indication information includes the coordinates in the dynamic image of a specified position within the image area where each of the one or more moving objects is located.

In the second case, based on the moving image sequence and the position indication information, the implementation process of rendering and displaying the image area where the one or more moving objects are located in the first frame image includes: from the one or more moving objects Select a moving object among the moving objects, and render and display the image area where the selected moving object is located according to the following operations, until the image area where each moving object is located is rendered and displayed: The coordinates of the specified position of the image area in the dynamic image, and the image area included in the sub-image sequence corresponding to the selected moving object is rendered and displayed in the first frame of image.

Since the encoding end directly encodes the coordinates of the specified position in the dynamic image in the image area where each moving object is located into the code stream, therefore, in the embodiment of the present application, the selected moving object corresponding to the selected moving object is parsed from the code stream. After the coordinates of the specified position are determined, the image area included in the sub-image sequence corresponding to the selected moving object can be directly rendered and displayed in the first frame image, which improves the image reconstruction speed.

When the encoding end encodes the number of the one or more moving objects into the code stream, the embodiment of the present application may also parse the number of the one or more moving objects from the code stream. In this way, by comparing the number of the one or more moving objects with the number of the one or more sub-image sequences, it can be determined whether there is a sub-image sequence that fails to transmit in the one or more sub-image sequences, thereby improving dynamic performance. Reliability of image decoding.

In the third case, the moving image sequence is a dynamic image, and the position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more motions object.

In the third case, based on the moving image sequence and the position indication information, the implementation process of rendering and displaying the image area where the one or more moving objects are located in the first frame image includes: from the one or more moving objects Select a moving object from the moving objects, and render and display the image area where the selected moving object is located according to the following operations, until the image area where each moving object is located is rendered and displayed: Based on the image segmentation mask, determine The position of the image area where the selected moving object is located, based on the position of the image area where the selected moving object is located, extract the image area where the selected moving object is located from each frame of the dynamic image except the first frame image. image area. According to the position of the image area where the selected moving object is located, the image area where the selected moving object is located in each frame of the dynamic image is rendered and displayed in the first frame of image.

In the fourth case, the position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects. At this time, the realization process of parsing the moving image sequence from the code stream includes: based on the image segmentation mask, determining a plurality of segmented regions corresponding to multiple objects one-to-one, and parsing out each of the multiple segmented regions from the codestream The object state corresponding to each segmented region, and the object state includes a static state or a moving state. Based on the object state corresponding to each of the plurality of divided areas, the image area divided by the divided area corresponding to the motion state is parsed from the code stream, and the moving image sequence is obtained.

The realization process of determining a plurality of segmentation regions corresponding to the plurality of objects one-to-one based on the image segmentation mask includes: determining a location region where each object of the plurality of objects is located based on the image segmentation mask. If the location area where any object is located among the plurality of objects does not include an integer number of CTUs, the boundary of the location area where any object is located is extended so that the location area where any object is located includes an integer number of CTUs. The location areas where the multiple objects are located after the expansion processing are determined as the multiple divided areas.

For the above four cases, the one or more moving objects mentioned above may be all moving objects among the multiple objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects among the multiple objects. That is, for moving objects in a dynamic image, the decoding end can also determine whether all these moving objects are in a moving state, or it is necessary to filter out a part of the objects that are in a moving state.

That is, an object selection instruction for selecting one or more objects from a plurality of objects included in the dynamic image is received. One or more objects selected by the object selection instruction are determined as one or more moving objects in the above steps.

The object selection instruction may be triggered by the user based on the first frame of image. For example, the first frame of image is marked with all moving objects in the dynamic image, and the user can select some or all of all the moving objects in the first frame of image. object, the selected object is one or more moving objects in the above steps.

In addition, the encoder type for encoding at the encoding end may be pre-agreed by the encoding end and the decoding end, or may be selected by the user of the encoding end. In the case of user selection, the encoding end also needs to encode the encoder type for encoding into the code stream. For the decoding end, it is also necessary to parse out the encoder type used for encoding from the code stream, and determine the corresponding decoder type according to the parsed encoder type, so that according to the determined decoder type, from the code stream Parse out the above image or image sequence.

A third aspect provides an apparatus for encoding a moving image, the encoding apparatus having a function of implementing the behavior of the method for encoding a moving image in the first aspect. The encoding device includes at least one module, and the at least one module is configured to implement the dynamic image encoding method provided in the first aspect above.

In a fourth aspect, a moving image decoding apparatus is provided, and the decoding apparatus has a function of implementing the behavior of the moving image decoding method in the second aspect. The decoding apparatus includes at least one module, and the at least one module is configured to implement the dynamic image decoding method provided in the second aspect above.

In a fifth aspect, an encoding end device is provided, the encoding end device includes a processor and a memory, and the memory is used for storing a program for executing the dynamic image encoding method provided in the first aspect. The processor is configured to execute the program stored in the memory, so as to implement the dynamic image encoding method provided in the first aspect.

Optionally, the encoding end device may further include a communication bus, and the communication bus is used to establish a connection between the processor and the memory.

In a sixth aspect, a decoding end device is provided, and the decoding end device includes a processor and a memory, and the memory is used for storing a program for executing the dynamic image decoding method provided in the second aspect. The processor is configured to execute the program stored in the memory, so as to implement the dynamic image decoding method provided in the second aspect.

Optionally, the decoding end device may further include a communication bus, and the communication bus is used to establish a connection between the processor and the memory.

In a seventh aspect, a computer-readable storage medium is provided, and instructions are stored in the storage medium, and when the instructions are executed on a computer, the computer is made to execute the steps of the dynamic image encoding method described in the first aspect above. , or perform the steps of the method for decoding a moving image described in the second aspect above.

In an eighth aspect, there is provided a computer program product comprising instructions, which, when the instructions are run on a computer, cause the computer to execute the steps of the method for encoding a dynamic image described in the first aspect above, or execute the steps of the method for encoding a dynamic image in the second aspect above. The steps of the decoding method of the moving picture described above.

The technical effects obtained by the third aspect, the fourth aspect, the fifth aspect, the sixth aspect, the seventh aspect and the eighth aspect are similar to the technical effects obtained by the corresponding technical means in the first aspect or the second aspect, here No longer.

The technical solutions provided in the embodiments of the present application can at least bring the following beneficial effects:

Description of drawings

1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an exemplary implementation environment provided by an embodiment of the present application;

3 is a schematic structural block diagram of an encoder provided by an embodiment of the present application;

4 is a schematic structural block diagram of a decoder provided by an embodiment of the present application;

5 is a flowchart of a first dynamic image encoding method provided by an embodiment of the present application;

6 is a flowchart of a first dynamic image decoding method provided by an embodiment of the present application;

7 is a block diagram of a first exemplary encoding and decoding method provided by an embodiment of the present application;

8 is a block diagram of a second exemplary encoding and decoding method provided by an embodiment of the present application;

9 is a flowchart of a second dynamic image encoding method provided by an embodiment of the present application;

10 is a flowchart of a second method for decoding a dynamic image provided by an embodiment of the present application;

11 is a block diagram of a third exemplary encoding and decoding method provided by an embodiment of the present application;

12 is a block diagram of a fourth exemplary encoding and decoding method provided by an embodiment of the present application;

13 is a flowchart of a third dynamic image encoding method provided by an embodiment of the present application;

14 is a flowchart of a third dynamic image decoding method provided by an embodiment of the present application;

15 is a flowchart of a fourth dynamic image encoding method provided by an embodiment of the present application;

16 is a flowchart of a fourth dynamic image decoding method provided by an embodiment of the present application;

17 is a block diagram of a fifth exemplary encoding and decoding method provided by an embodiment of the present application;

18 is a schematic structural diagram of an apparatus for encoding a dynamic image provided by an embodiment of the present application;

19 is a schematic structural diagram of an apparatus for decoding a dynamic image provided by an embodiment of the present application;

FIG. 20 is a schematic block diagram of an encoding and decoding apparatus provided by an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

Before explaining the dynamic image encoding and decoding method provided by the embodiments of the present application in detail, the terms and implementation environments involved in the embodiments of the present application are first introduced.

For ease of understanding, the terms involved in the embodiments of the present application are explained first.

Encoding: refers to the process of compressing the image to be encoded into a code stream. Among them, coding is mainly divided into image coding and video coding. Image encoding is a process of compressing a still image to be encoded into a code stream, and video encoding is a process of compressing a sequence of images included in a video to be encoded into a code stream.

A dynamic image is an image in which a group of static images is switched according to a specified frequency to generate a dynamic effect. In the embodiment of the present application, the encoding of the dynamic image is divided into the encoding of the static image and the encoding of the video.

It should be noted that, after a static image is compressed into a code stream, it may also be called an encoded still image, and after a video is compressed into a code stream, it may also be called an encoded video. Similarly, for the encoding of dynamic images, the dynamic images can also be called encoded dynamic images after being compressed into a code stream.

Decoding: refers to the process of restoring the encoded code stream into a reconstructed image according to specific grammar rules and processing methods. Among them, the decoding is mainly divided into the decoding of the image code stream and the decoding of the video code stream. The decoding of the image code stream refers to the process of restoring the image code stream into a reconstructed image, and the decoding of the video code stream refers to the process of restoring the video code stream into a reconstructed video.

Sub-image sequence: refers to a sequence of image regions extracted from each frame of image included in the image sequence.

Coding block: refers to the coding area obtained by dividing the image to be coded. A frame of image can be divided into multiple coding blocks, and the multiple coding blocks together form the frame image. Among them, each coding block can be independently coded.

The coding block may be composed of tiles or slices. One tile includes at least one coding tree unit (coding tree unit, CTU), and one slice includes multiple CTUs.

Next, the implementation environment involved in the embodiments of the present application will be introduced.

Please refer to FIG. 1 , which is a schematic diagram of an implementation environment provided by an embodiment of the present application. The implementation environment includes source device 10 , destination device 20 , link 30 and storage device 40 . Therein, the source device 10 may generate an encoded dynamic image. Therefore, the source device 10 may also be referred to as a moving image coding device. The destination device 20 may decode the encoded moving image generated by the source device 10 . Therefore, the destination device 20 may also be referred to as a moving picture decoding device. Link 30 may receive the encoded dynamic image generated by source device 10 and may transmit the encoded dynamic image to destination device 20 . The storage device 40 can receive the encoded dynamic image generated by the source device 10, and can store the encoded dynamic image. Under such conditions, the destination device 20 can directly obtain the encoded dynamic image from the storage device 40. image. Alternatively, storage device 40 may correspond to a file server or another intermediate storage device that may hold encoded dynamic images generated by source device 10, in which case destination device 20 may transmit or download storage device 40 via streaming or download Stored encoded moving images.

Source device 10 and destination device 20 may each include one or more processors and memory coupled to the one or more processors, which may include random access memory (RAM), read only memory ( read-only memory, ROM), electrically erasable programmable read-only memory (EEPROM), flash memory, which can be used to store desired programs in the form of instructions or data structures that can be accessed by a computer any other medium of code, etc. For example, both source device 10 and destination device 20 may include desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, Televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers or the like.

Link 30 may include one or more media or devices capable of transmitting encoded moving images from source device 10 to destination device 20 . In one possible implementation, link 30 may include one or more communication media that enable source device 10 to transmit encoded dynamic images directly to destination device 20 in real-time. In the embodiment of the present application, the source device 10 may modulate the encoded moving image according to a communication standard, which may be a wireless communication protocol, etc., and may transmit the modulated moving image to the destination device 20 . The one or more communication media may include wireless and/or wired communication media, eg, the one or more communication media may include a radio frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, which may be a local area network, a wide area network, or a global network (eg, the Internet), among others. The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from the source device 10 to the destination device 20, etc., which are not specifically limited in this embodiment of the present application.

In a possible implementation manner, the storage device 40 may store the received encoded dynamic images sent by the source device 10 , and the destination device 20 may directly acquire the encoded dynamic images from the storage device 40 . Under such conditions, storage device 40 may include any of a variety of distributed or locally-accessed data storage media, for example, any of the plurality of distributed or locally-accessed data storage media may be Hard disk drive, Blu-ray disc, digital versatile disc (DVD), compact disc read-only memory (CD-ROM), flash memory, volatile or nonvolatile memory, or use Any other suitable digital storage medium for storing encoded moving images, etc.

In one possible implementation, storage device 40 may correspond to a file server or another intermediate storage device that may hold encoded dynamic images generated by source device 10, destination device 20 may store via streaming or download 40 stored dynamic images. The file server may be any type of server capable of storing encoded moving images and transmitting the encoded moving images to the destination device 20 . In a possible implementation manner, the file server may include a network server, a file transfer protocol (FTP) server, a network attached storage (NAS) device, a local disk drive, or the like. The destination device 20 may acquire the encoded moving images over any standard data connection, including an Internet connection. Any standard data connection may include a wireless channel (eg, a Wi-Fi connection), a wired connection (eg, a digital subscriber line (DSL), cable modem, etc.), or suitable for obtaining encoded data stored on a file server A combination of both of the dynamic images. Transmission of the encoded moving images from storage device 40 may be streaming, download transmission, or a combination of the two.

The implementation environment shown in FIG. 1 is only a possible implementation manner, and the techniques of the embodiments of the present application are not only applicable to the source device 10 shown in FIG. The destination device 20 for decoding images can also be applied to other devices that can encode moving images and decode the encoded moving images, which is not specifically limited in this embodiment of the present application.

In the implementation environment shown in FIG. 1 , the source device 10 includes a data source 120 , an encoder 100 and an output interface 140 . In some embodiments, output interface 140 may include a conditioner/demodulator (modem) and/or a transmitter, which may also be referred to as a transmitter. Data source 120 may include an image capture device (eg, a camera, etc.), an archive containing previously captured dynamic images, a feed interface for receiving dynamic images from dynamic image content providers, and/or a computer for generating dynamic images A graphics system, or a combination of these sources of dynamic images.

The data source 120 may send a dynamic image to the encoder 100, and the encoder 100 may encode the dynamic image received from the data source 120 to obtain an encoded dynamic image. The encoder can send the encoded moving image to the output interface. In some embodiments, source device 10 sends the encoded dynamic image directly to destination device 20 via output interface 140 . In other embodiments, the encoded dynamic images may also be stored on storage device 40 for later retrieval by destination device 20 and for decoding and/or display.

In the implementation environment shown in FIG. 1 , the destination device 20 includes an input interface 240 , a decoder 200 and a display device 220 . In some embodiments, input interface 240 includes a receiver and/or a modem. The input interface 240 may receive the encoded moving image via the link 30 and/or from the storage device 40, and then send it to the decoder 200, and the decoder 200 may decode the received encoded moving image to obtain the decoded moving image. dynamic images. The decoder may transmit the decoded moving image to the display device 220 . Display device 220 may be integrated with destination device 20 or may be external to destination device 20 . Generally, the display device 220 displays the decoded moving image. The display device 220 may be any of various types of display devices, for example, the display device 220 may be a liquid crystal display (liquid crystal display, LCD), a plasma display, an organic light-emitting diode (organic light-emitting diode, OLED) Display or other type of display device.

Although not shown in FIG. 1, in some aspects encoder 100 and decoder 200 may each be integrated with an audio encoder and decoder, and may include appropriate multiplexer-demultiplexer- demultiplexer, MUX-DEMUX) unit or other hardware and software for encoding of both audio and video in a common data stream or separate data streams. In some embodiments, MUX-DEMUX units may conform to the ITU H.223 multiplexer protocol, or other protocols such as the user datagram protocol (UDP), if applicable.

The encoder 100 and the decoder 200 may each be any of the following circuits: one or more microprocessors, digital signal processing (DSP), application specific integrated circuit (ASIC) ), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the techniques of the present embodiments are implemented in part in software, a device may store instructions for the software in a suitable non-volatile computer-readable storage medium, and may use one or more processors in hardware The instructions are executed to implement the techniques of the embodiments of the present application. Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered one or more processors. Each of the encoder 100 and the decoder 200 may be included in one or more encoders or decoders, either of which may be integrated into a combined encoding in a corresponding device part of the encoder/decoder (codec).

Embodiments of the present application may generally refer to encoder 100 as "signaling" or "sending" certain information to another device, such as decoder 200 . The terms "signaling" or "sending" may generally refer to the transmission of syntax elements and/or other data used to decode compressed moving images. This transfer can occur in real time or near real time. Alternatively, this communication may occur over a period of time, such as may occur when the syntax elements are stored to the computer-readable storage medium in the encoded bitstream at the time of encoding, and the decoding device may then store the syntax elements to this medium. to retrieve the syntax element at any time.

Please refer to FIG. 2 , which is a schematic diagram of an exemplary implementation environment provided by an embodiment of the present application. The implementation environment includes a cloud server 101 and a terminal device 201 , and the cloud server 101 is in communication connection with the terminal device 201 . The communication connection may be a wireless connection or a wired connection, which is not limited in this embodiment of the present application.

The cloud server 101 may be the source device 10 in the implementation environment shown in FIG. 1 above. The cloud server 101 is used to encode the dynamic image based on the encoding, and transmit the encoded dynamic image to the terminal device 201 .

The terminal device 201 may be the destination device 20 in the implementation environment shown in FIG. 1 above. The terminal device 201 is used for decoding the encoded dynamic image transmitted by the cloud server 101, and displaying the dynamic image obtained after decoding.

Optionally, the terminal device 201 is further configured to collect images and transmit the collected images to the cloud server 101 , and the cloud server 101 generates dynamic images based on the images collected by the terminal device 201 to provide the cloud server 101 with a data source.

The terminal device 201 can be any electronic product that can interact with the user through one or more ways such as a keyboard, a touchpad, a touchscreen, a remote control, a voice interaction, or a handwriting device, for example, a personal computer. , PC), mobile phones, smart phones, personal digital assistants (Personal Digital Assistant, PDA), wearable devices, PPC (pocket PC), tablet PCs, smart cars, smart TVs, smart speakers, etc.

The cloud server 101 may be a server, a server cluster composed of multiple servers, or a cloud computing service center.

Those skilled in the art should understand that the above-mentioned terminal device 201 and the cloud server 101 are only examples, and other existing or future terminals or servers, if applicable to the embodiments of the present application, should also be included in the protection scope of the embodiments of the present application and is hereby incorporated by reference.

Please refer to FIG. 3 , which is a schematic structural block diagram of an encoder 100 provided by an embodiment of the present application. The encoder 100 includes an encoding mode determination module 110, a semantic segmentation module 111, an image sequence extraction module 112, a position indication information encoding module 113, an image encoding module 114, a first video encoding module 115, a first code stream packaging module 116, Two video encoding module 117 and second code stream packaging module 118 .

The encoding mode determination module 110 is configured to determine the encoding mode of the dynamic image, that is, to determine whether the dynamic image is encoded in the region division encoding mode or in the video encoding mode. The region division coding mode refers to the coding mode provided by the embodiments of the present application, and the video coding mode refers to the traditional coding mode. That is to say, the dynamic image may be coded according to the coding mode provided by the embodiment of the present application, or may be coded according to the traditional video coding mode.

When the dynamic image is encoded in the region segmentation encoding mode, the encoder 100 includes a semantic segmentation module 111, an image sequence extraction module 112, a position indication information encoding module 113, an image encoding module 114, a first video encoding module 115 and a first code Stream encapsulation module 116 . When the dynamic image is encoded in the video encoding mode, the encoder 100 includes a second video encoding module 117 and a second code stream encapsulation module 118 .

The semantic segmentation module 111 is used to perform semantic segmentation on any frame image in the dynamic image to obtain an image segmentation mask. The image sequence extraction module 112 is used to extract a moving image sequence from the dynamic image, and the moving image sequence may be a sub-image sequence corresponding to a moving object in the dynamic image, or may be the dynamic image itself. The following embodiments will be described according to the situation. , will not be elaborated here. The position indication information encoding module 113 is used to encode the position indication information to obtain a code stream including the encoded position indication information, and the position indication information can be an image segmentation mask, or it can be the image area in which the moving object is located. Specifies the coordinates of the location in the dynamic image. The coordinates of the specified position in the image area where the moving object is located in the dynamic image may be determined based on the image segmentation mask.

The image encoding module 114 is configured to encode the first frame of image in the dynamic image to obtain the encoded code stream of the first frame of image. It should be noted that the moving image sequence may be the sub-image sequence corresponding to the moving object in the moving image, or the moving image itself. In the case that the moving image sequence is the sub-image sequence corresponding to the moving object in the moving image, the The image encoding module 114 encodes the first frame of image. When the moving image sequence is the moving image itself, the first frame image may not be encoded. At this time, the encoder 100 may not include the image encoding module 114 .

The first video encoding module 115 is configured to encode the moving image sequence determined by the image sequence extraction module 112 to obtain a code stream of the encoded moving image sequence. The first code stream encapsulation module 116 is used to encapsulate the code stream encoded by the position indication information encoding module 113, the image encoding module 114 and the first video encoding module 115, so as to obtain a combined code stream, and then combine the combined code stream. sent to the output interface 140. The output interface 140 can send the combined code stream to the decoder 200 .

It should be noted that, for the region segmentation coding mode, the embodiments of the present application provide various implementations. In different implementations, the encoder 100 may include a position indication information coding module 113, an image coding module 114, and a first video coding module All the modules in the module 115 may also include the position indication information encoding module 113 , the image encoding module 114 and some modules in the first video encoding module 115 .

The second video encoding module 117 is configured to encode the moving image in a video encoding manner to obtain a code stream including the encoded moving image. The second code stream encapsulation module 118 is configured to encapsulate the code stream encoded by the second video encoding module 117 , and send the encapsulated code stream to the output interface 140 . The output interface 140 can send the combined code stream to the decoder 200 .

It should be understood that the encoder 100 shown in FIG. 3 is only an implementation manner provided by the embodiments of the present application, and in other implementation manners, the encoder 100 may include more or less modules than those shown in FIG. 3 module. This embodiment of the present application does not limit this.

Please refer to FIG. 4 , which is a schematic structural block diagram of a decoder 200 provided by an embodiment of the present application. The decoder 200 includes a decoding mode determination module 210 , a position indication information decoding module 211 , an image decoding module 212 , a first video decoding module 213 , an image synthesis module 214 and a second video decoding module 215 .

The decoding mode determination module 210 is used for determining the decoding mode of the moving image, that is, for determining whether the moving image is decoded in the region division decoding mode or in the video decoding mode. The region division decoding mode refers to the decoding mode provided by the embodiments of the present application, and the video decoding mode refers to the traditional decoding mode. That is to say, in the case where the dynamic image is encoded according to the encoding mode provided by the embodiment of the present application, the dynamic image can be decoded in accordance with the decoding mode provided by the embodiment of the present application. In the case where the dynamic image is encoded in accordance with the traditional encoding mode In this case, the decoding can be performed according to the traditional video decoding mode.

When the moving image is decoded in the region division decoding mode, the decoder 200 includes a position indication information decoding module 211 , an image decoding module 212 , a first video decoding module 213 and an image synthesis module 214 . When the moving image is decoded in the video decoding mode, the decoder 200 includes a second video decoding module 215 .

The location indication information decoding module 211 is configured to decode the code stream including the encoded location indication information to obtain the location indication information. The position indication information may be an image segmentation mask, or may be the coordinates in the dynamic image of a specified position in the image area where the moving object is located.

The image decoding module 212 is used for parsing the first frame of image from the code stream. It should be noted that the moving image sequence may be the sub-image sequence corresponding to the moving object in the moving image, or may be the moving image itself. In the case where the moving image sequence is the sub-image sequence corresponding to the moving object in the moving image, the encoding The transmitted code stream includes the encoded code stream of the first frame of image, and at this time, the image decoding module 212 is configured to decode the code stream including the encoded first frame of image to obtain the first frame of image. When the moving image sequence is the moving image itself, the image decoding module 212 is configured to parse the first frame image from the code stream including the encoded moving image.

The first video decoding module 213 is configured to decode the code stream including the encoded moving image sequence to obtain the moving image sequence. The moving image sequence may be a sub-image sequence corresponding to a moving object in the moving image, or may be the moving image itself. The following embodiments will be described according to situations, and will not be described in detail here. The image synthesis module 214 is configured to synthesize the images decoded by the position indication information decoding module 211 , the image decoding module 212 and the first video decoding module 213 to obtain a dynamic image, and transmit the dynamic image to the display device 220 . The display device 220 can display moving images.

It should be noted that, for the region segmentation decoding mode, the embodiments of the present application provide various implementations. In different implementations, the decoder 200 may include a position indication information decoding module 211, an image decoding module 212, and a first video decoding module. All the modules in the module 213 may also include the position indication information decoding module 211 , the image decoding module 212 and some modules in the first video decoding module 213 .

The second video decoding module 215 is configured to decode the code stream including the decoded moving image to obtain the moving image. After that, the dynamic image can be transmitted to the display device 220 . The display device 220 can display moving images.

It should be understood that the decoder 200 shown in FIG. 4 is only an implementation manner provided by the embodiments of the present application, and in other implementation manners, the decoder 200 may include more or less modules than those shown in FIG. 4 . module. This embodiment of the present application does not limit this.

Next, the method for encoding and decoding a dynamic image provided by the embodiment of the present application will be described. It should be noted that, with reference to the implementation environment shown in FIG. 1 , any of the following dynamic image encoding methods may be executed by the encoder 100 in the source device 10 . Taking FIG. 2 as an example, any of the following dynamic image encoding methods may be executed by the cloud server 101 in FIG. 2 . Any of the following methods of decoding a moving image may be performed by the decoder 200 in the destination device 20 . Taking FIG. 2 as an example, any of the following dynamic image decoding methods may be performed by the terminal device 201 in FIG. 2 .

In the dynamic image encoding method provided in the embodiment of the present application, any frame of image in the dynamic image can be semantically segmented to obtain an image segmentation mask, where the dynamic image includes multiple objects, and the image segmentation mask includes multiple Multiple image regions corresponding to objects one-to-one. Based on the moving images, a moving image sequence is determined, and each frame of the image in the moving image sequence includes an image area where one or more moving objects of the plurality of objects are located. Based on the image segmentation mask, position indication information is determined, the position indication information being used to indicate the position of the image area in which the one or more moving objects are located. The moving image sequence and position indication information are encoded into the code stream.

In the dynamic image decoding method provided by the embodiment of the present application, the first frame of image can be parsed from the code stream, and the moving image sequence and position indication information can be parsed from the code stream, and each frame of image in the moving image sequence includes The image area where one or more moving objects are located, and the position indication information is used to indicate the location of the image area where the one or more moving objects are located. Based on the moving image sequence and the position indication information, the image area where the one or more moving objects are located is rendered and displayed in the first frame of image to obtain a moving image.

It should be noted that the moving image sequence may include one or more sub-image sequences corresponding to the one or more moving objects, or may be the moving image itself. The position indication information may be an image segmentation mask, or may be the coordinates in the dynamic image of the specified position of the image region where each moving object of the one or more moving objects is located. Therefore, the following will be divided into several embodiments, and the method for encoding and decoding a dynamic image provided by the embodiments of the present application will be explained in detail.

Please refer to FIG. 5 . FIG. 5 is a flowchart of a first dynamic image encoding method provided by an embodiment of the present application. In this method, the moving image sequence includes one or more sub-image sequences, and the position indication information is an image segmentation mask. The encoding method includes the following steps.

Step 501: Semantic segmentation is performed on any frame image in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, the image segmentation mask includes multiple image regions corresponding to the multiple objects one-to-one, and the image segmentation The mask is used to indicate the location of the image area in which the one or more moving objects are located.

Step 502: Extract one or more sub-image sequences based on the image segmentation mask and the dynamic image, where the one or more sub-image sequences correspond to one or more moving objects among the multiple objects one-to-one.

The extraction method of the sub-image sequence corresponding to each moving object is the same. Therefore, in some embodiments, a moving object may be selected from the one or more moving objects, and the sub-image sequence corresponding to the selected moving object is determined according to the following operations , until the sub-image sequence corresponding to each moving object is determined: based on the image segmentation mask, determine the location area where the selected moving object is located, and remove the first frame image from the dynamic image based on the location area where the selected moving object is located. The image area where the selected moving object is located is extracted from each frame of images except for the selected moving object, and the sub-image sequence corresponding to the selected moving object is obtained.

In general, the area formed by the outline of the moving object is an irregular area, that is, the location area where the moving object is located is not a regular area. Therefore, in some embodiments, each frame of the dynamic image except the first frame image can be directly extracted from the moving image. The image area in the position area where the moving object is located is extracted from the frame image. Of course, in other embodiments, the location area where the moving object is located can also be processed as a regular area, and then the image area within the regular area is extracted from each frame of images in the dynamic image except the first frame of image .

It should be noted that there are various implementations for extending the location area where the moving object is located, for example, determining the minimum abscissa, the minimum ordinate, the maximum abscissa and the maximum ordinate from the set of pixel coordinates corresponding to the selected moving object. , and then, determine a square area where the abscissa is between the minimum abscissa and the maximum abscissa, and the ordinate is between the minimum ordinate and the maximum ordinate, and the square area is determined as the expanded location area. Or, directly based on the location area where the moving object is located, a square area circumscribing the location area is drawn, and the circumscribing square area is determined as the expanded location area. This embodiment of the present application does not limit the manner of expansion, as long as the expanded location area includes the location area where the moving object is located.

For example, for a moving object K, the pixel value of the moving object K in the image segmentation mask is Mk. Scan each pixel in the image segmentation mask to determine the coordinates of the pixel whose pixel value is Mk, so as to obtain the pixel coordinate set: {(x _k1 , y _k1 ), (x _k2 , y _k2 ), ... ..., (x _kN , y _kN )}, where N is the number of pixels whose pixel value is Mk. At this time, the minimum abscissa min_X _k , the minimum ordinate min_Y _k , the maximum abscissa max_X _k and the maximum ordinate max_Y _k , that is, min_Xk=min{x _k1 , x _k2 , …, x _kN }, min_Yk=min{y _k1 , y _k2 ,...,y _kN },max_Xk=max{x _k1 ,x _k2 ,...,x _kN },max_Yk=max{y _k1 , y _k2 , ..., y _kN }. At this time, the square area where the coordinates in the set {(x,y)|min_Xk<=x<=max_Xk, min_Yk<=x<=max_Yk)} may be determined as the expanded position area corresponding to the moving object K. Then, an image area located in the expanded position area is extracted from each frame of the moving image except the first frame image.

Step 503: Encode the first frame image in the dynamic image, the one or more sub-image sequences and the image segmentation mask into the code stream.

For the first frame image and the image segmentation mask in the dynamic image, an image encoder can be used to encode the code stream. For each sub-picture sequence in the one or more sub-picture sequences, a video encoder may be used to encode the code stream.

For the convenience of description, the image encoder used in the first frame image in the dynamic image is called the first image encoder, and the image encoder used in the image segmentation mask is called the second image encoder. The video encoder employed by the multiple sub-image sequences is called the first video encoder. Wherein, the first image encoder and the second image encoder may be the same or different.

Usually, because the pixel values of each pixel in the image area where the same object is located in the first frame image in the dynamic image are different, the pixel values of each pixel in the image area where the same object is located in the image segmentation mask are different. The pixel values are the same. Therefore, an image encoder with higher encoding efficiency can be used to encode the first frame image in the dynamic image, and a general image encoder can be used to encode the image segmentation mask.

It should be noted that, the encoding end and the decoding end may agree on the first image encoder, the second image encoder, and the first video encoder in advance. Of course, the first image encoder, the second image encoder and the first video encoder can also be selected by the user. When the user selects the first image encoder, the second image encoder and the first video encoder, the type of the first image encoder, the type of the second image encoder and the type of the first video encoder also need to be set into the code stream. Moreover, these image encoders and video encoders may be encoders included in the encoding end itself.

For each code stream obtained by the above encoding, it is also necessary to encapsulate each code stream to obtain a combined code stream, and then transmit the combined code stream to the decoding end.

Wherein, the embodiment of the present application may adopt the International Organization for Standardization Basic Media File Format (international organization for standards basic media file format, ISOBMFF) (ISO/IEC 14496-12-MPEG-4 Part 12) standard to encapsulate the above-mentioned respective code streams, This embodiment of the present application does not limit this. Certainly, the embodiment of the present application may also extend the HEIF (ISO/IEC 23008-12 standard) format to encapsulate the above-mentioned code streams.

For example, it is assumed that the embodiment of the present application adds a derived image sequence based on the high efficiency image file format (HEIF) (ISO/IEC 23008-12 standard) format, and the type is sovl, indicating that the derived image sequence is Obtained by superimposing one or more sub-image sequences on the first frame image. The one or more sub-image sequences and the first frame image are specified by a sequence reference box (SequenceReferenceBox). The one or more sub-image sequences are encapsulated in the track specified by the HEIF standard, and the first frame image is encapsulated in the item specified by the HEIF standard.

The syntax for this derived image sequence is as follows:

where output_width and output_height are the width and height of the output derived image sequence.

The reference_count is determined by SequenceReferenceBox, and represents the number of the one or more sub-image sequences.

horizontal_offset and vertical_offset represent the offset of the sub-image sequence relative to the upper left corner of the first frame image.

Wherein, from_track_id represents the identifier of the derived image sequence, to_item_id represents the identifier of the first frame image, reference_count represents the number of the one or more sub-image sequences, and to_track_id represents the identifier of the sub-image sequence.

In the embodiment of the present application, only the image area where the moving object is located in the dynamic image will change, and the image area where the stationary object is located will not change, and the image segmentation mask is used to indicate where one or more moving objects are located. Therefore, after extracting the image area where each moving object is located from each frame of the dynamic image except the first frame image, the first frame image, the image segmentation mask, and the The image area where each moving object is located in each frame of the dynamic image is encoded into the code stream, and the dynamic image can be subsequently decoded. That is, the image area where the moving object is located and the image area where the stationary object is located in the dynamic image are divided, and then the image area where the moving object is located is encoded into the code stream, without the need to encode the image area where the stationary object is located. into the code stream to improve the coding efficiency. In addition, since the embodiment of the present application can directly multiplex the encoder included in the encoder itself, it is only necessary to encapsulate each code stream obtained by encoding, and there is no need to design the corresponding encoder separately.

Please refer to FIG. 6 . FIG. 6 is a flowchart of a first dynamic image decoding method provided by an embodiment of the present application, and the decoding method corresponds to the encoding method shown in FIG. 5 . The decoding method includes the following steps.

Step 601: Parse the first frame of image from the code stream.

Based on the above description, the image encoder used for the first frame of image is called the first image encoder, and for convenience of description, the image decoder used for the first frame of image may also be called the first image decoder.

Because the first image encoder may be pre-agreed by the encoding end and the decoding end, or may be selected by the user during the encoding process. Therefore, if the first image encoder is pre-agreed by the encoding end and the decoding end, the first image decoder is also pre-agreed by the encoding end and the decoding end. The first frame of image is parsed from the code stream. In the case where the first image encoder is selected by the user, the type of the first image encoder needs to be parsed from the code stream, and then the first image decoder is determined based on the type of the first image encoder, and then according to the determined The first image decoder parses the first frame of image from the code stream.

Step 602: Parse out one or more sub-image sequences and an image segmentation mask from the code stream. The image segmentation mask includes multiple image regions corresponding to multiple objects one-to-one. One or more moving objects included in the object are in one-to-one correspondence, and the image segmentation mask is used to indicate the position of the image area where the one or more moving objects are located.

Based on the above description, the image encoder used by the image segmentation mask is called the second image encoder, and for convenience of description, the image decoder used by the image segmentation mask may also be called the second image decoder. Similarly, the video decoder adopted by the one or more sub-image sequences is referred to as the first video decoder.

Because the second image encoder may be pre-agreed by the encoding end and the decoding end, or may be selected by the user during the encoding process. Therefore, if the second image encoder is pre-agreed by the encoding end and the decoding end, the second image decoder is also pre-agreed by the encoding end and the decoding end. The image segmentation mask is parsed from the code stream. In the case where the second image encoder is selected by the user, the type of the second image encoder needs to be parsed from the code stream first, and then the second image decoder is determined based on the type of the second image encoder, and then according to the determined type The second image decoder parses the image segmentation mask from the code stream.

Similarly, since the first video encoder may be pre-agreed by the encoding end and the decoding end, or may be selected by the user during the encoding process. Therefore, in the case where the first video encoder is pre-agreed by the encoding end and the decoding end, the first video decoder is also pre-agreed by the encoding end and the decoding end. Each sub-image sequence in the one or more sub-image sequences is parsed from the code stream. In the case where the first video encoder is selected by the user, the type of the first video encoder needs to be parsed from the code stream, and then the corresponding first video decoder is determined based on the type of the first video encoder, and then according to The determined first video decoder parses each sub-image sequence in the one or more sub-image sequences from the code stream.

Step 603: Based on the one or more sub-image sequences and the image segmentation mask, render and display the image area where the one or more moving objects are located in the first frame of image to obtain a dynamic image.

The process of rendering and displaying the image area where each moving object is located in the first frame of image is the same. Therefore, in some embodiments, a moving object may be selected from the one or more moving objects, and the following operations are performed. Render and display the image area where the selected moving objects are located until the image area where each moving object is located is rendered and displayed: Determine the position of the image area where the selected moving objects are located based on the image segmentation mask . According to the position of the image area where the selected moving object is located, the image area included in the sub-image sequence corresponding to the selected moving object is rendered and displayed in the first frame of image.

It should be noted that, for the implementation process of performing the expansion processing on the location area formed by the pixel coordinate set, reference may be made to the relevant description in the foregoing step 502, which is not repeated in this embodiment of the present application.

In addition, the rendering sequence of the image area where the one or more moving objects are located is consistent with the sequence of the code stream of the image area in the entire code stream.

In the case where the encoding end encapsulates each code stream by extending the HEIF (ISO/IEC 23008-12 standard) format, the embodiment of the present application can obtain the code stream of the first frame image through to_item_id, and then decode to obtain For the first frame of image, the code stream of each sub-image sequence is obtained according to to_track_id, and then decoded to obtain the sub-image sequence. Then, according to the horizontal_offset and vertical_offset, in the order of to_track_id analysis, the one or more sub-image sequences are superimposed on the first frame On the image, the reconstructed image of the derived image sequence, that is, the reconstructed dynamic image is obtained.

The one or more moving objects mentioned in the above steps 601-603 may be all moving objects among the multiple objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects among the multiple objects. That is, for moving objects in a dynamic image, the decoding end can also determine whether all these moving objects are in a moving state, or it is necessary to filter out a part of the objects that are in a moving state.

In this embodiment of the present application, since the image segmentation mask is used to indicate the position of the image region where the one or more moving objects are located in the dynamic image, after parsing the first frame of image from the code stream, the The position of the image area where each moving object is located in the dynamic image, and the image area where the one or more moving objects are located is rendered and displayed in the first frame of image. That is, when decoding a dynamic image, after the first frame of image is decoded, only the image area where the moving object is located for subsequent images needs to be decoded, and there is no need to decode the image area where the still object is located, which effectively reduces the need for decoding. Decoding complexity and power consumption. Moreover, in the process of displaying the dynamic image, it is only necessary to render and refresh the image area where the moving object is located on the basis of the first frame of image, thereby effectively reducing the power consumption of the display.

Next, with reference to FIG. 7 , the method for encoding and decoding a dynamic image provided by the embodiments shown in FIG. 5 and FIG. 6 will be exemplarily described.

Encoding side steps:

1) The user selects the encoder to use, and encodes the following syntax elements at the system layer to indicate:

image_codec_type: The image encoder type that encodes the first frame of image, for example, image_codec_type can be 0 or 1, 0 means joint photographic experts group (JPEG), 1 means Portable Network Graphic Format , PNG); other types of encoders can also be indicated, such as better portable graphics (BPG), which is not limited here.

mask_codec_type: The image encoder type that encodes the image segmentation mask, such as JPEG or PNG; it can also indicate other types of encoders, such as BPG, which is not limited here.

video_codec_type: The type of video encoder that encodes the sub-picture sequence, such as H.265. Other types of encoders can also be indicated, such as H.264, which is not limited here.

2) according to image_codec_type calling the corresponding encoder to encode the first frame image, the encoding of the first frame image can use an efficient image encoder;

3) Call the corresponding encoder to encode the image segmentation mask according to mask_codec_type, and the encoding of the image segmentation mask can use a general image encoder;

4) The image area where the moving object is located is extracted from the images in the dynamic image except the first frame image by using the image segmentation mask to form several sub-video sequences. The value of object K in mask is denoted as Mk. The extraction method of the image area where the object K is located is as follows:

Loop each moving object, assuming the image area where the current extraction object K is located, scan the image segmentation mask line by line, record the coordinates of the pixel value Mk in the image segmentation mask, and form a set: {(xk,1 ,yk,1),(xk,2,yk,2),…,(xk,N,yk,N)}, where N is the number of coordinate points;

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=min{yk,1,yk,2,…,yk,N}

The position of the coordinates in the set {(x,y)|min_Xk<=x<=max_Xk,min_Yk<=x<=max_Yk)} is the position of the object K.

Extract the square area where the object K is located in the image except the first frame image in the dynamic image as a sub-image sequence.

5) according to video_codec_type calling corresponding video encoder to encode each sub-image sequence;

6) According to ISOBMFF (ISO/IEC 14496-12-MPEG-4 Part 12) standard, splicing, encapsulating (and transmitting) the code stream obtained in the above steps.

Decoding side steps:

1) Decode the following information at the system layer:

image_codec_type

mask_codec_type

video_codec_type

2) According to image_codec_type, call the corresponding image decoder to decode and display the first frame image;

3) According to mask_codec_type, call the corresponding decoder to decode the image segmentation mask;

4) Use the image segmentation mask to determine the position of each moving object in the image. The pixel value of object K in the image segmentation mask is denoted as Mk. The position of the moving object is determined as follows:

Loop for each moving object, assuming that the position of the object K is currently determined, scan the image segmentation mask line by line, and record the coordinates of the pixel value Mk. These coordinates form a set: {(xk,1,yk,1),(xk,2,yk,2),…,(xk,N,yk,N)};

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=max{yk,1,yk,2,…,yk,N}

The position of the object in the set {(x,y)|min_Xk<=x<=max_Xk,min_Yk<=y<=max_Yk)} is the position of the object K.

5) According to video_codec_type, call the corresponding decoder to decode the sub-image sequence;

6) Render and display refresh for each moving object at its corresponding position. The order of object rendering is consistent with the order of the object's codestream in the entire codestream.

On the basis of FIG. 7 , user interactivity can also be increased, that is, the user at the decoding end can choose to make a specific object move, while other areas remain stationary. Next, with reference to FIG. 8 , the method for encoding and decoding a dynamic image provided by the embodiments shown in FIG. 5 and FIG. 6 will be exemplarily described.

Encoding side steps:

1) The user selects the encoder type used (such as H.265 encoder), and encodes the following syntax elements at the system layer to indicate:

image_codec_type: The image encoder type that encodes the first frame of image, for example, image_codec_type can be 0 or 1, 0 means JPEG, 1 means PNG; it can also indicate other types of encoders, such as BPG, which is not limited here.

video_codec_type: The type of video encoder that encodes the sub-picture sequence or the moving picture itself, such as H.265. Other types of encoders can also be indicated, such as H.264, which is not limited here.

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=min{yk,1,yk,2,…,yk,N}

Decoding side steps:

1) The system layer decodes the following information:

image_codec_type;

mask_codec_type;

video_codec_type.

2) according to image_codec_type, select the corresponding image decoder to decode and display the first frame image;

3) The user clicks on the corresponding position in the first frame of image, and chooses to move the object to which the position belongs; or chooses to move all the objects;

4) According to mask_codec_type, call the corresponding decoder to decode the image segmentation mask;

5) According to the position (x, y) clicked by the user, use the solution in step 6) to determine the coordinate range of each moving object, and according to the current click position, the position range and code stream index of the current moving object can be determined. Decoding the substream of the selected moving object to obtain the reconstruction of the selected moving object;

6) Use the image segmentation mask to determine the position of the moving object in the image. The pixel value of object K in the image segmentation mask is denoted as Mk. The specific object position determination method is as follows:

Assuming that the position of the object K is currently determined, scan the image segmentation mask line by line, and record the coordinates of the pixel value Mk in the image segmentation mask. These coordinates form a set: {(xk,1,yk,1),(xk, 2,yk,2),…,(xk,N,yk,N)};

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=max{yk,1,yk,2,…,yk,N}

7) Render and display refresh for each moving object at its corresponding position. The order of object rendering is consistent with the order of the object's codestream in the entire codestream.

Please refer to FIG. 9 , which is a flowchart of a second dynamic image encoding method provided by an embodiment of the present application. In the method, the sequence of moving images includes one or more sequences of sub-images, and the location indication information includes coordinates of one or more designated locations. The encoding method includes the following steps.

Step 901: Semantically segment any frame of image in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, and the image segmentation mask includes multiple image regions corresponding to the multiple objects one-to-one.

For the content of step 901, reference may be made to the relevant description in step 501, which is not repeated in this embodiment of the present application.

Step 902 : Extract one or more sub-image sequences based on the image segmentation mask and the dynamic image, where the one or more sub-image sequences correspond one-to-one with one or more moving objects in the plurality of objects.

For the content of step 902, reference may be made to the relevant description in step 502, which is not repeated in this embodiment of the present application.

Step 903 : Determine the coordinates in the dynamic image of a specified position in the image area where each of the one or more moving objects is located, and obtain the coordinates of one or more specified positions.

Since the position of the image area where each moving object is located in the process of extracting the sub-image sequence in step 902 has been determined, that is, the location area where each moving object is located, or the square where the location area where each moving object is located is expanded area. Therefore, the coordinates of the specified position within the image area where each moving object is located can be directly determined.

Step 904: Encode the first frame image in the dynamic image, the one or more sub-image sequences, and the coordinates of the one or more designated positions into the code stream.

Optionally, in this embodiment of the present application, the number of the one or more moving objects may also be encoded into the code stream. In this way, for the decoding end, based on the number of the one or more moving objects, it can be determined whether there is a sub-image sequence that fails to transmit in the one or more sub-image sequences, thereby ensuring the reliability of dynamic image decoding.

For other contents in step 904, reference may be made to the relevant description in step 503, which is not repeated in this embodiment of the present application.

In this embodiment of the present application, only the image area where the moving object is located in the dynamic image will change, and the image area where the stationary object is located will not change. After extracting the image area where each moving object is located from the frame image, and determining the coordinates of the specified position in the image area where each moving object is located in the dynamic image, the first frame image, each moving object in the dynamic image The image area in each frame of the image and the coordinates of the specified position in the image area where each moving object is located in the dynamic image are encoded into the code stream, and the dynamic image can be decoded subsequently. That is, the image area where the moving object is located and the image area where the stationary object is located in the dynamic image are divided, and then the image area where the moving object is located is encoded into the code stream, without the need to encode the image area where the stationary object is located. into the code stream to improve the coding efficiency. In addition, since the embodiment of the present application can directly multiplex the encoder included in the encoder itself, it is only necessary to encapsulate each code stream obtained by encoding, and there is no need to design the corresponding encoder separately.

Please refer to FIG. 10 . FIG. 10 is a flowchart of a second dynamic image decoding method provided by an embodiment of the present application, and the decoding method corresponds to the encoding method shown in FIG. 9 . The decoding method includes the following steps.

Step 1001: Parse the first frame of image from the code stream.

For the content in step 1001, reference may be made to the relevant description in step 601, which is not repeated in this embodiment of the present application.

Step 1002: Parse out one or more sub-image sequences and the coordinates of one or more specified positions from the code stream, the one or more sub-image sequences are in one-to-one correspondence with one or more moving objects, and the one or more specified positions The coordinates of are in one-to-one correspondence with the one or more moving objects, and the designated position refers to the designated position in the image area where the corresponding moving object is located.

For the content in step 1002, reference may be made to the relevant description in step 602, which is not repeated in this embodiment of the present application.

Step 1003: Based on the one or more sub-image sequences and the coordinates of one or more designated positions, render and display the image area where the one or more moving objects are located in the first frame of image to obtain a dynamic image.

The process of rendering and displaying the image area where each moving object is located in the first frame of image is the same. Therefore, in some embodiments, a moving object may be selected from the one or more moving objects, and the following operations are performed. Render and display the image area where the selected moving object is located, until the image area where each moving object is located is rendered and displayed: according to the specified position of the image area where the selected moving object is located in the dynamic image. coordinates, and render and display the image area included in the sub-image sequence corresponding to the selected moving object in the first frame of image.

The one or more moving objects mentioned in the above steps 1001-1003 may be all moving objects among the multiple objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects among the multiple objects. That is, for moving objects in a dynamic image, the decoding end can also determine whether all these moving objects are in a moving state, or it is necessary to filter out a part of the objects that are in a moving state.

For other content in step 1003, reference may be made to the relevant description in step 603, which is not repeated in this embodiment of the present application.

In the embodiment of the present application, after parsing the first frame of image from the code stream, according to the coordinates of the specified position in the image area where each moving object is located in the dynamic image, the first frame of image or multiple moving objects are located in the image area for rendering and display. That is, when decoding a dynamic image, after the first frame of image is decoded, only the image area where the moving object is located for subsequent images needs to be decoded, and there is no need to decode the image area where the still object is located, which effectively reduces the need for decoding. Decoding complexity and power consumption. Moreover, in the process of displaying the dynamic image, it is only necessary to render and refresh the image area where the moving object is located on the basis of the first frame of image, thereby effectively reducing the power consumption of the display.

Next, with reference to FIG. 11 , an exemplary description will be given of the dynamic image encoding and decoding methods provided by the embodiments shown in FIG. 9 and FIG. 10 . Wherein, in this embodiment, the image segmentation mask does not need to be encoded, and only the starting position of each moving object needs to be encoded.

Encoding side steps:

1) Use the image segmentation mask to determine the position of each moving object in the image. The pixel value of object K in the image segmentation mask is denoted as Mk, and the number of objects num_sub_sequences is set to 0. The specific way of determining the position of the moving object is as follows:

Loop for each moving object, assuming the current extraction object K.

Scan the image segmentation mask line by line, and record the coordinates of the pixel value Mk in the image segmentation mask. These coordinates form a set: {(xk,1,yk,1),(xk,2,yk,2),…, (xk,N,yk,N)};

If the above coordinate set is not empty, num_sub_sequences=num_sub_sequences+1;

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=max{yk,1,yk,2,…,yk,N}

The area represented by the set {(x,y)|min_Xk<=x<=max_Xk,min_Yk<=y<=max_Yk)} is the area where the object K is located. Extract the square area where the object of the area is located to obtain the sub-image sequence;

The width of the sub-image sequence is max_Xk-min_Xk, and the height is max_Yk-min_Yk;

Add min_Xk to position_top_left_x_list;

Add min_Yk to position_top_left_y_list.

2) The system layer coding indicates the following information:

num_sub_sequences: the number of sub-image sequences

position_top_left_x_list: list of upper left horizontal position coordinates

position_top_left_y_list: List of vertical position coordinates of the upper left corner

3) according to image_codec_type calling the corresponding encoder to encode the first frame image, the encoding of the first frame image can use an efficient image encoder;

4) according to video_codec_type calling corresponding video encoder to encode each sub-image sequence;

5) According to the ISOBMFF (ISO/IEC 14496-12-MPEG-4 Part 12) standard, splicing, encapsulating (and transmitting) the code stream obtained in the above steps.

Decoding side steps:

1) Decode the following information at the system layer:

image_codec_type

video_codec_type

num_sub_sequences

position_top_left_x_list

position_top_left_y_list

3) According to video_codec_type, select the corresponding video decoder to decode the sub-image sequence. Specifically, each sub-image j in the K-th frame image is processed:

The corresponding code stream is decoded to obtain the object reconstruction;

Get the top-leftmost position of the object: position_top_left_x_list[j], position_top_left_y_list[j];

4) Render and display refresh for each moving object at its corresponding position. The order of object rendering is consistent with the order of the object's codestream in the entire codestream.

Optionally, in this embodiment of the present application, the HEIF (ISO/IEC 23008-12 standard) format may be extended to encapsulate the foregoing code streams. For example, adding a derived image sequence based on the HEIF (ISO/IEC 23008-12 standard) format, the type is sovl, indicating that the derived image sequence is obtained by superimposing one or more sub-image sequences on the first frame image. The one or more sub-image sequences and the first frame image are specified by a sequence reference box (SequenceReferenceBox). The one or more sub-image sequences are encapsulated in the track specified by the HEIF standard, and the first frame image is encapsulated in the item specified by the HEIF standard.

The syntax for this derived image sequence is as follows:

On the basis of FIG. 11, the user's interactivity can be increased, that is, the user at the decoding end can choose to make a specific object move, while other areas remain stationary. Next, with reference to FIG. 12 , an exemplary description will be given of the dynamic image encoding and decoding methods provided by the embodiments shown in FIG. 9 and FIG. 10 .

Encoding side steps:

Loop for each moving object, assuming the current extraction object K.

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=max{yk,1,yk,2,…,yk,N}

Add min_Xk to position_top_left_x_list;

Add min_Yk to position_top_left_y_list.

2) The system layer coding indicates the following information:

num_sub_sequences: the number of sub-image sequences

position_top_left_x_list: list of upper left horizontal position coordinates

Decoding side steps:

1) Decode the following information at the system layer:

image_codec_type

video_codec_type

num_sub_sequences

position_top_left_x_list

position_top_left_y_list

3) The user issues an instruction to choose to move a specific object or all objects;

4) according to the command signal issued by the user, the sub-code stream of the selected object is decoded to obtain the reconstruction of the selected object;

5) According to video_codec_type, select the corresponding video decoder to decode the sub-image sequence. Specifically, the sub-image j corresponding to the target object in the K-th frame image is processed:

The corresponding code stream is decoded to obtain the object reconstruction;

Optionally, the embodiment of the present application may extend the HEIF (ISO/IEC 23008-12 standard) format, add a syntax for deriving the image sequence, and encapsulate the first frame image and the sub-image sequence. Specific reference is made to the above content description.

Please refer to FIG. 13 . FIG. 13 is a flowchart of a third dynamic image encoding method provided by an embodiment of the present application. In this method, the moving image sequence is a moving image, and the position indication information is an image segmentation mask. The encoding method includes the following steps.

Step 1301: Semantically segment any frame of images in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, and the image segmentation mask includes multiple image regions corresponding to the multiple objects one-to-one.

For the content of step 1301, reference may be made to the relevant description in step 501, which is not repeated in this embodiment of the present application.

Step 1302: Encode the image segmentation mask and the moving image into a code stream, where the image segmentation mask is used to indicate the position of the image area where the one or more moving objects are located.

For the image segmentation mask, an image encoder can be used to encode the code stream. For the moving image, a video encoder can be used to encode the code stream. For convenience of description, the image encoder used for the image segmentation mask is called the second image encoder, and the video encoder used for the moving image is called the second video encoder. Wherein, the second video encoder and the above-mentioned first video encoder may be the same or different.

It should be noted that, the encoding end and the decoding end may make an agreement on the second image encoder and the second video encoder in advance. Of course, the second image encoder and the second video encoder can also be selected by the user. When the user selects the second image encoder and the second video encoder, the type of the second image encoder and the type of the second video encoder also need to be encoded into the code stream. Moreover, these image encoders and video encoders may be encoders included in the encoding end itself.

For each code stream obtained by the above encoding, it is also necessary to encapsulate each code stream to obtain a combined code stream, and then transmit the combined code stream to the decoding end. The embodiments of the present application may use the ISOBMFF (ISO/IEC 14496-12-MPEG-4 Part 12) standard to encapsulate the above-mentioned code streams, which are not limited in the embodiments of the present application.

In the embodiment of the present application, only the image area where the moving object is located in the dynamic image will change, and the image area where the stationary object is located will not change, and the image segmentation mask is used to indicate where one or more moving objects are located. Therefore, the image segmentation mask and the entire dynamic image are encoded into the code stream, so that after subsequent decoding, the image area where the moving object is located can be extracted from the dynamic image based on the image segmentation mask. The basic line of the first frame image is rendered and displayed, without rendering and displaying the image area where the stationary object is located again, which reduces the display power consumption. In addition, since the embodiment of the present application can directly multiplex the encoder included in the encoder itself, it is only necessary to encapsulate each code stream obtained by encoding, and there is no need to design the corresponding encoder separately.

Please refer to FIG. 14 . FIG. 14 is a flowchart of a third dynamic image decoding method provided by an embodiment of the present application, and the decoding method corresponds to the encoding method shown in FIG. 13 . The decoding method includes the following steps.

Step 1401: Parse the first frame of image from the code stream.

For the content in step 1401, reference may be made to the relevant description in step 601, which is not repeated in this embodiment of the present application.

Step 1402: Parse out the image segmentation mask and the dynamic image from the code stream. The image segmentation mask includes multiple image regions corresponding to multiple objects one-to-one, the multiple objects include the one or more moving objects, and the image segmentation The mask is used to indicate the location of the image area in which the one or more moving objects are located.

For the implementation process of parsing the image segmentation mask from the code stream, reference may be made to the relevant description in step 602, which is not repeated in this embodiment of the present application.

Based on the above description, the video encoder used for the moving image is called the second video encoder, and for convenience of description, the video decoder used for the moving image may also be called the second video decoder.

Because the second video encoder may be pre-agreed by the encoding end and the decoding end, or may be selected by the user during the encoding process. Therefore, if the second video encoder is pre-agreed by the encoding end and the decoding end, the second video decoder is also pre-agreed by the encoding end and the decoding end. The dynamic image is parsed from the code stream. In the case where the second video encoder is selected by the user, the type of the second video encoder needs to be parsed from the code stream first, and then the second video decoder is determined based on the type of the second video encoder, and then the second video decoder is determined according to the determined type. The second video decoder parses the dynamic image from the code stream.

Step 1403: Based on the image segmentation mask and the dynamic image, render and display the image area where the one or more moving objects are located in the first frame of image to obtain a dynamic image.

The process of rendering and displaying the image area where each moving object is located in the first frame of image is the same. Therefore, in some embodiments, a moving object may be selected from the one or more moving objects, and the following operations are performed. Render and display the image area where the selected moving objects are located until the image area where each moving object is located is rendered and displayed: Determine the position of the image area where the selected moving objects are located based on the image segmentation mask , based on the position of the image area where the selected moving object is located, extract the image area where the selected moving object is located from each frame of images in the dynamic image except the first frame of image. According to the position of the image area where the selected moving object is located, the image area where the selected moving object is located in each frame of the dynamic image is rendered and displayed in the first frame of image.

The implementation process of determining the position of the image region where the selected moving object is located based on the image segmentation mask may refer to the relevant description in the foregoing step 603, which will not be repeated in this embodiment of the present application. Based on the position of the image area where the selected moving object is located, the implementation process of extracting the image area where the selected moving object is located from each frame of images in the dynamic image except the first frame image may refer to the above step 502. Related descriptions are also omitted in this embodiment of the present application.

It should be noted that the rendering sequence of the image region where the one or more moving objects are located is consistent with the sequence of the code stream of the image region in the entire code stream.

The one or more moving objects mentioned in the above steps 1401-1403 may be all moving objects among the multiple objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects among the multiple objects. That is, for moving objects in a dynamic image, the decoding end can also determine whether all these moving objects are in a moving state, or it is necessary to filter out a part of the objects that are in a moving state.

In this embodiment of the present application, since the image segmentation mask is used to indicate the position of the image region where the one or more moving objects are located in the dynamic image, after parsing the first frame of image from the code stream, the The position of the image area where each moving object is located in the dynamic image, the image area where each moving object is located is extracted from each frame of the dynamic image except the first frame image, and then in the first frame image Render and display the image area where the one or more moving objects are located. That is, in the process of displaying a dynamic image, it is only necessary to render and refresh the image area where the moving object is located on the basis of the first frame of image, thereby effectively reducing the power consumption of the display.

The encoding and decoding methods for moving images provided by the embodiments shown in the above-mentioned FIG. 13 and FIG. 14 are exemplified. Among them, in this embodiment, the user can choose to make certain objects move, while other areas remain stationary.

Encoding side steps:

1) The user selects the encoder type used, and encodes the following syntax elements at the system layer to indicate:

2) Call the corresponding encoder to encode the image segmentation mask according to mask_codec_type, and the encoding of the image segmentation mask can use a general image encoder;

3) regard the dynamic image as a complete video, and call the corresponding video encoder to encode the dynamic image according to video_codec_type;

4) According to the ISOBMFF (ISO/IEC 14496-12-MPEG-4 Part 12) standard, splicing, encapsulating (and transmitting) the code stream obtained in the above steps. Or transmit the encoded code stream of the image segmentation mask through SEI message.

Decoding side steps:

1) The system layer decodes the following information:

mask_codec_type;

video_codec_type.

2) According to mask_codec_type, select the corresponding decoder to decode the image segmentation mask;

3) According to video_codec_type, select the corresponding decoder to decode and reconstruct the moving image;

4) The user issues an instruction and chooses to move a specific object or all objects;

5) According to the command signal issued by the user, the designated object is rendered at its corresponding position, and the display is refreshed.

Please refer to FIG. 15. FIG. 15 is a flowchart of a fourth dynamic image encoding method provided by an embodiment of the present application. In this method, the sequence of moving objects includes dynamic images, and the position indication information includes an image segmentation mask. The encoding method includes the following steps.

Step 1501: Semantically segment any frame of image in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, and the image segmentation mask includes multiple image regions corresponding to the multiple objects one-to-one.

For the content of step 1501, reference may be made to the relevant description in step 501, which is not repeated in this embodiment of the present application.

Step 1502: Based on the image segmentation mask, determine a plurality of segmentation regions corresponding to the plurality of objects one-to-one.

In some embodiments, the location area where each object of the plurality of objects is located may be determined based on the image segmentation mask. If the location area where any object is located among the plurality of objects does not include an integer number of CTUs, the boundary of the location area where any object is located is extended so that the location area where any object is located includes an integer number of CTUs. The location areas where the multiple objects are located after the expansion processing are determined as the multiple divided areas.

Step 1503 : Divide each frame of the dynamic image except the first frame of images according to the plurality of divided areas to obtain a plurality of image areas.

Since the plurality of divided regions are in one-to-one correspondence with the plurality of objects, after each frame of images in the dynamic image except the first frame of image is divided into regions, each frame of image will include the same number of divided regions as the plurality of divided regions. A corresponding image area, that is, an image area corresponding to the plurality of objects one-to-one.

Step 1504: Determine an object state corresponding to each of the plurality of divided regions, where the object state includes a static state or a moving state.

Since one segmented area corresponds to one object, the state of the object corresponding to each segmented area may be determined as the state of the object corresponding to the corresponding segmented area.

Step 1505: Encode the first frame image in the dynamic image, the multiple image regions, the object state corresponding to each segmented region in the multiple segmented regions, and the image segmentation mask into the code stream. to indicate the position of the image area where the one or more moving objects are located.

For the content of encoding the first frame of image and the image segmentation mask in the dynamic image, reference may be made to the relevant description in step 503, which is not repeated in this embodiment of the present application.

For the multiple image areas, the implementation process of encoding the multiple image areas into the code stream includes: encoding each image area in the multiple image areas as an encoding block into the code stream respectively. Or, an area composed of each row of CTUs in each of the multiple image areas is encoded into the code stream as a coding block. Wherein, the location area where the reference coding block is located is located in the location area where the referenced coding block is located.

In the embodiment of the present application, only the image area where the moving object is located in the dynamic image will change, and the image area where the stationary object is located will not change, and the dynamic image is divided into the first frame image through multiple segmentation areas. After each frame of image is divided into regions, the first frame of image, the divided image regions, the object state corresponding to each divided region, and the image segmentation mask are encoded into the code stream, and then decoded later. dynamic images. That is, the image area where the moving object is located and the image area where the stationary object is located in the dynamic image are divided, and then the image area where the moving object is located and the image area where the stationary object is located are separately encoded into the code stream, In this way, in subsequent decoding, only the image area corresponding to the moving state needs to be decoded, and the image area corresponding to the static state does not need to be decoded, which improves the decoding efficiency. In addition, since the embodiment of the present application can directly multiplex the encoder included in the encoder itself, it is only necessary to encapsulate each code stream obtained by encoding, and there is no need to design the corresponding encoder separately.

Please refer to FIG. 16 . FIG. 16 is a flowchart of a fourth dynamic image decoding method provided by an embodiment of the present application, and the decoding method corresponds to the encoding method shown in FIG. 15 . The decoding method includes the following steps.

Step 1601: Parse the first frame of image from the code stream.

For the content in step 1601, reference may be made to the relevant description in step 601, which is not repeated in this embodiment of the present application.

Step 1602: Parse out the image segmentation mask from the code stream. The image segmentation mask includes multiple image regions corresponding to multiple objects one-to-one, and the multiple objects include the one or more moving objects. The image segmentation mask uses to indicate the position of the image area where the one or more moving objects are located.

For the content in step 1602, reference may be made to the relevant description in step 602, which is not repeated in this embodiment of the present application.

Step 1603: Based on the image segmentation mask, determine a plurality of segmentation regions corresponding to the plurality of objects one-to-one.

For the content in step 1603, reference may be made to the relevant description in step 1602, which is not repeated in this embodiment of the present application.

Step 1604: Parse out the object state corresponding to each of the plurality of divided regions from the code stream, where the object state includes a static state or a moving state.

Step 1605: Based on the object state corresponding to each of the plurality of divided regions, parse out the image region divided by the divided region corresponding to the motion state from the code stream.

Since a segmented area corresponds to an object state, the object state can be a moving state or a static state, and each image area in the code stream is divided by the divided area, so it can be directly parsed from the code stream. The image area divided by the segmentation area corresponding to the motion state.

Step 1606: Render and display the image area where the one or more moving objects are located in the first frame of image based on the image area divided by the segmentation area corresponding to the motion state to obtain a dynamic image.

That is, in the first frame of image, the image area divided by the segmentation area corresponding to the motion state is rendered and displayed to obtain a dynamic image.

The one or more moving objects mentioned in the above steps 1601-1606 may be all moving objects among the multiple objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects among the multiple objects. That is, for moving objects in a dynamic image, the decoding end can also determine whether all these moving objects are in a moving state, or it is necessary to filter out a part of the objects that are in a moving state.

In this embodiment of the present application, after parsing the first frame of image from the code stream, the image area divided by the segment area corresponding to the motion state can be parsed from the code stream according to the object state corresponding to each segment area, without the need for The image area divided by the segmentation area corresponding to the static state is parsed, which effectively reduces decoding complexity and power consumption. Moreover, in the process of displaying the dynamic image, it is only necessary to render and refresh the image area divided by the divided area corresponding to the motion state on the basis of the first frame of image, thereby effectively reducing the power consumption of the display.

Next, with reference to FIG. 17 , an exemplary description will be given of the dynamic image encoding and decoding methods provided by the embodiments shown in FIG. 15 and FIG. 16 . In this embodiment, the image division method in the existing video coding standard is used. On the encoding side, each frame of the dynamic image is divided into several fixed-pattern slices or tiles using an image segmentation mask, and each slice/tile can be decoded independently.

Encoding side steps:

video_codec_type: The type of video encoder that encodes the moving image itself, such as H.265. Other types of encoders can also be indicated, such as H.264, which is not limited here.

2) call corresponding encoder according to image_codec_type to encode the first frame image, the encoding of the first frame image can use an efficient image encoder;

3) Call the corresponding encoder to encode the image segmentation mask according to mask_codec_type, and the encoding of the image segmentation mask can use a mainstream image encoder;

4) Use the image segmentation mask to divide each frame of image in the dynamic image into slices or tiles with a fixed pattern.

First, the image segmentation mask is used to determine the position of the moving object in the image. The pixel value of object K in the image segmentation mask is denoted as Mk. The specific object position determination method is as follows:

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=max{yk,1,yk,2,…,yk,N}

The position of the object in the set {(x,y)|min_Xk<=x<=max_Xk,min_Yk<=y<=max_Yk)} is the position of the object K;

Boundary processing: If the area determined by the above coordinate set does not contain an integer number of CTUs (upper, lower, left, and right are not on the CTU boundary), then fill up several rows/columns of pixels up, down, left, and right respectively, so that the current area contains an integer number of CTUs;

If tile is used for area division, the above square area can be directly used as an independent tile; if slice is used for area division, the area composed of CTUs in each row of the above square area is used as a separate slice;

5) Invoke the corresponding video encoder according to video_codec_type to encode the divided slices or tiles individually. When encoding, it is necessary to constrain the range of the inter-frame prediction motion vector to be in the slice/tile of the corresponding position of the reference image, and MCTS can be used for the H.265 encoder;

Decoding side steps:

1) The system layer decodes the following information:

image_codec_type;

mask_codec_type;

video_codec_type.

2) The system layer extracts each sub-code stream from the code stream for subsequent decoding;

3) According to image_codec_type, call the corresponding image decoder to decode the first frame image and display it;

5) The system layer controls the decoder to only decode the slice or tile corresponding to the object in the motion state according to the image segmentation mask;

6) Use the image segmentation mask to divide the dynamic image into slices or tiles of a fixed pattern;

Find the minimum and maximum values in the above coordinates:

min_Xk=min{xk,1,xk,2,...,xk,N}

min_Yk=min{yk,1,yk,2,…,yk,N}

max_Xk=max{xk,1,xk,2,...,xk,N}

max_Yk=max{yk,1,yk,2,…,yk,N}

Boundary processing: If the area formed by the above set does not contain an integer number of CTUs (the upper, lower, left, and right are not on the CTU boundary), then fill several rows/columns of pixels up, down, left, and right, respectively, so that the current area contains an integer number of CTUs ;

7) The system layer uses the segmentation area and object state obtained by the image segmentation mask, skips the slice/tile corresponding to the object in the static state, and only decodes the slice/tile corresponding to the object in the moving state;

8) Render and display refresh for each object at its corresponding position. The order in which objects are rendered is consistent with the order of the slice/tile codestream in the entire codestream.

Among them, on the basis of FIG. 17 , the interactivity of the user can also be increased, that is, the user at the decoding end can choose to make a specific object move, while other areas remain stationary.

FIG. 18 is a schematic structural diagram of a dynamic image encoding apparatus provided by an embodiment of the present application. The encoding apparatus may be implemented by software, hardware, or a combination of the two to become part or all of an encoding end device, and the encoding end device may be shown in FIG. The source device shown in 1 may also be the cloud server shown in FIG. 2 . Referring to FIG. 18 , the apparatus includes: a semantic segmentation module 1801 , an image sequence extraction module 1802 , a position indication information determination module 1803 and a first encoding module 1804 .

The semantic segmentation module 1801 is used to perform semantic segmentation on any frame image in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, and the image segmentation mask includes multiple images corresponding to the multiple objects one-to-one area. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. The semantic segmentation module 1801 corresponds to the semantic segmentation module 111 in FIG. 3 .

The image sequence extraction module 1802 is configured to determine a moving image sequence based on the moving image, and each frame of the image in the moving image sequence includes an image area where one or more moving objects in the plurality of objects are located. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. The image sequence extraction module 1802 corresponds to the image sequence extraction module 112 in FIG. 3 .

The location indication information determining module 1803 is configured to determine location indication information based on the image segmentation mask, where the location indication information is used to indicate the location of the image area where the one or more moving objects are located. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. The modules corresponding to the location indication information determining module 1803 are not shown in FIG. 3 .

The first encoding module 1804 is used for encoding the moving image sequence and position indication information into the code stream. For the detailed implementation process, refer to the corresponding content in each of the foregoing embodiments, which will not be repeated here. The first encoding module 1804 corresponds to the location indication information encoding module 113 and the first video encoding module 115 in FIG. 3 .

Optionally, the moving image sequence includes one or more sub-image sequences, and the position indication information is an image segmentation mask;

Image sequence extraction module 1802 includes:

The image sequence extraction sub-module is used to extract the one or more sub-image sequences based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects.

Optionally, the moving image sequence includes one or more sub-image sequences, and the location indication information includes coordinates of one or more designated locations;

Image sequence extraction module 1802 includes:

an image sequence extraction sub-module, configured to extract the one or more sub-image sequences based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects;

The location indication information determination module 1803 includes:

The position coordinate determination sub-module is used for determining the coordinates in the dynamic image of the specified position in the image area where each moving object of the one or more moving objects is located based on the image segmentation mask.

Optionally, the image sequence extraction submodule includes:

The selection sub-module is used to select a moving object from the one or more moving objects, and the sub-image sequence corresponding to the selected moving object is determined by the following modules, until the sub-image sequence corresponding to each moving object is determined:

The location area determination sub-module is used to determine the location area where the selected moving object is located based on the image segmentation mask;

The image area extraction sub-module is used to extract the image area where the selected moving object is located from each frame of the dynamic image except the first frame image based on the position area, and obtain the sub-image sequence corresponding to the selected moving object.

Optionally, the location area determination submodule is specifically used for:

Scan each pixel in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, and the pixel coordinate set includes the coordinates of a plurality of pixel points;

The location area formed by the pixel coordinate set is determined as the location area where the selected moving object is located.

Optionally, the image region extraction submodule is specifically used for:

Extracting the image area located in the position area from each frame of image in the dynamic image except the first frame image;

or,

The position area is expanded so that the expanded position area is a square area, and an image area located in the expanded position area is extracted from each frame of images in the dynamic image except the first frame image.

Optionally, the specified position is the position with the smallest coordinates, or the position with the largest coordinates.

Optionally, the device also includes:

The second encoding module is configured to encode the number of the one or more moving objects into the code stream. The modules corresponding to the second encoding module are not shown in FIG. 3 .

Optionally, the moving image sequence is a moving image, and the position indication information is an image segmentation mask.

Optionally, the device also includes:

a segmentation area determination module, configured to determine a plurality of segmentation areas corresponding to the multiple objects one-to-one based on the image segmentation mask;

an area division module, configured to perform area division on each frame of image in the dynamic image except the first frame image according to the plurality of divided areas, so as to obtain a plurality of image areas;

an object state determination module, configured to determine an object state corresponding to each segmented region in the plurality of segmented regions, and the object state includes a static state or a motion state;

The first encoding module includes:

an image region encoding submodule, used for encoding the multiple image regions into a code stream;

The device also includes:

The third encoding module is configured to encode the object state corresponding to each of the plurality of divided regions into the code stream.

Wherein, Fig. 3 does not show the modules corresponding to the segmentation region determination module, the region division module, the object state determination module and the third encoding module. The image region coding sub-module corresponds to the first video coding module 115 in FIG. 3 .

Optionally, the segmented area determination module is specifically used for:

determining the location area where each object in the plurality of objects is located based on the image segmentation mask;

If the location area where any object is located among the multiple objects does not include an integer number of coding tree units CTUs, the boundary of the location area where any object is located is extended, so that the location area where any object is located includes an integer number of CTUs CTU;

The location areas where the multiple objects are located after the expansion processing are determined as the multiple divided areas.

Optionally, the image region coding submodule is specifically used for:

Encoding each image area in the multiple image areas as an encoding block into the code stream respectively;

or,

Encoding the region composed of each row of CTUs in each of the multiple image regions as a coding block into the code stream;

Wherein, the location area where the reference coding block is located is located in the location area where the referenced coding block is located.

Optionally, the device also includes:

The fourth encoding module is used for encoding the first frame image of the dynamic image into the code stream. The fourth encoding module corresponds to the image encoding module 114 in FIG. 3 .

It should be noted that: when the dynamic image encoding apparatus provided in the above embodiments performs encoding of dynamic images, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions may be allocated by different The function module is completed, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the functions described above. In addition, the dynamic image encoding apparatus provided in the above embodiments and the dynamic image encoding method embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.

FIG. 19 is a schematic structural diagram of a dynamic image decoding apparatus provided by an embodiment of the present application. The decoding apparatus may be implemented by software, hardware, or a combination of the two to become part or all of a decoding end device, and the decoding end device may be as shown in FIG. The destination device shown in 1 may also be the terminal device shown in FIG. 2 . Referring to FIG. 19 , the apparatus includes: an image decoding module 1901 , a first decoding module 1902 and an image synthesis module 1903 .

The image decoding module 1901 is used for parsing the first frame of image from the code stream. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. The image decoding module 1901 corresponds to the image decoding module 212 in FIG. 4 .

The first decoding module 1902 is used to parse out the moving image sequence and position indication information from the code stream, each frame of image in the moving image sequence includes an image area where one or more moving objects are located, and the position indication information is used to indicate the location indication information. The location of the image area where one or more moving objects are located. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. Wherein, the first decoding module 1902 corresponds to the location indication information decoding module 211 and the first video decoding module 213 in FIG. 4 .

The image synthesis module 1903 is configured to render and display the image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information, to obtain a moving image. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. The image synthesis module 1903 corresponds to the image synthesis module 214 in FIG. 4 .

Optionally, the moving image sequence includes one or more sub-image sequences, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects;

The position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects.

Optionally, the image synthesis module 1903 includes:

The selection sub-module is used to select a moving object from the one or more moving objects, and the following modules are used to render and display the image area where the selected moving object is located, until the image area where each moving object is located is performed. Render and display so far:

a position determination sub-module for determining the position of the image area where the selected moving object is located based on the image segmentation mask;

The rendering and display sub-module is configured to render and display the image area included in the sub-image sequence corresponding to the selected moving object in the first frame of image according to the position of the image area where the selected moving object is located.

Optionally, the location determination submodule is specifically used for:

The location area formed by the pixel coordinate set is determined as the position of the image area where the selected moving object is located, or the location area formed by the pixel coordinate set is expanded, so that the expanded location area is a square area, and the expanded location area is a square area. The location area is determined as the location of the image area where the selected moving object is located.

The position indication information includes the coordinates in the dynamic image of a specified position within the image area where each of the one or more moving objects is located.

Optionally, the image synthesis module 1903 is specifically used for:

Select a moving object from the one or more moving objects, and perform the following operations to render and display the image area where the selected moving object is located, until the image area where each moving object is located is rendered and displayed:

According to the coordinates in the dynamic image of the specified position in the image area where the selected moving object is located, the image area included in the sub-image sequence corresponding to the selected moving object is rendered and displayed in the first frame of image.

Optionally, the device also includes:

The second decoding module is configured to parse out the number of the one or more moving objects from the code stream. For the detailed implementation process, refer to the corresponding contents in the foregoing embodiments, which will not be repeated here. The modules corresponding to the second decoding module are not shown in FIG. 4 .

Optionally, the moving image sequence is a moving image, and the position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects. .

Optionally, the image synthesis module 1903 is specifically used for:

Determine the position of the image area where the selected moving object is located based on the image segmentation mask;

Based on the position of the image area where the selected moving object is located, extracting the image area where the selected moving object is located from each frame of images in the dynamic image except the first frame image;

According to the position of the image area where the selected moving object is located, the image area where the selected moving object is located in each frame of the dynamic image is rendered and displayed in the first frame of image.

Optionally, the position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects;

The first decoding module 1902 includes:

a segmentation area determination submodule, used for determining a plurality of segmentation areas corresponding to the multiple objects one-to-one based on the image segmentation mask;

an object state determination submodule, used for parsing out the object state corresponding to each of the plurality of divided regions from the code stream, and the object state includes a static state or a motion state;

The image area decoding sub-module is used for analyzing the image area divided by the divided area corresponding to the motion state from the code stream based on the object state corresponding to each divided area in the plurality of divided areas to obtain a moving image sequence.

Optionally, the segmented area determination submodule is specifically used for:

In the case where the location area where any object is located in the plurality of objects does not contain an integer number of CTUs, the boundary of the location area where any object is located is extended, so that the location area where any object is located contains an integer number of CTUs;

Optionally, the device also includes:

an instruction receiving module, configured to receive an object selection instruction, and the object selection instruction is used to select one or more objects from a plurality of objects included in the dynamic image;

The moving object determination module is configured to determine one or more objects selected by the object selection instruction as the one or more moving objects.

Optionally, the device also includes:

The third decoding module is used to parse out the encoder type used for encoding from the code stream;

The decoder type determination module is used to determine the corresponding decoder type according to the parsed encoder type.

The modules corresponding to the instruction receiving module, the moving object determination module, the third decoding module and the decoder type determination module are not shown in FIG. 4 .

In the dynamic image decoding method provided by the embodiment of the present application, after decoding the first frame of image, only the image area where the moving object is located needs to be decoded for subsequent images, and the image area where the still object is located does not need to be decoded, effectively Decoding complexity and power consumption are reduced. Moreover, in the process of displaying the dynamic image, it is only necessary to render and refresh the image area where the moving object is located on the basis of the first frame of image, thereby effectively reducing the power consumption of the display.

It should be noted that when the dynamic image decoding apparatus provided in the above embodiments decodes the dynamic image, only the division of the above functional modules is used as an example for illustration. In practical applications, the above functions may be allocated by different The function module is completed, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the functions described above. In addition, the apparatus for decoding a dynamic image provided in the above-mentioned embodiments and the embodiments of the method for decoding a dynamic image belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.

FIG. 20 is a schematic block diagram of an encoding and decoding apparatus 2000 used in an embodiment of the present application. The encoding and decoding apparatus 2000 may include a processor 2001 , a memory 2002 and a bus system 2003 . The processor 2001 and the memory 2002 are connected through a bus system 2003, the memory 2002 is used to store instructions, and the processor 2001 is used to execute the instructions stored in the memory 2002, so as to execute the coding of various dynamic images described in the embodiments of this application or decoding method. To avoid repetition, detailed description is omitted here.

In this embodiment of the present application, the processor 2001 may be a central processing unit (central processing unit, CPU), and the processor 2001 may also be other general-purpose processors, DSPs, ASICs, FPGAs, or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 2002 may include a ROM device or a RAM device. Any other suitable type of storage device may also be used as memory 2002. Memory 2002 may include code and data 20021 accessed by processor 2001 using bus 2003 . The memory 2002 may further include an operating system 20023 and an application program 20022, where the application program 20022 includes at least one program that allows the processor 2001 to execute the dynamic image encoding or decoding method described in the embodiments of the present application. For example, the application 20022 may include applications 1 to N, which further include moving image encoding or decoding applications (referred to as moving image encoding and decoding applications) that execute the moving image encoding or decoding methods described in the embodiments of the present application.

In addition to the data bus, the bus system 2003 may also include a power bus, a control bus, a status signal bus, and the like. However, for the sake of clarity, the various buses are labeled as bus system 2003 in the figure.

Optionally, the codec apparatus 2000 may also include one or more output devices, such as a display 2004 . In one example, the display 2004 may be a touch-sensitive display that incorporates a display with a touch-sensitive unit operable to sense touch input. Display 2004 may be connected to processor 2001 via bus 2003 .

It should be noted that the encoding and decoding apparatus 2000 may execute the method for encoding a dynamic image in the embodiment of the present application, and may also execute the method for decoding a dynamic image in the embodiment of the present application.

Those skilled in the art will appreciate that the functions described in connection with the various illustrative logical blocks, modules, and algorithm steps described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions described by the various illustrative logical blocks, modules, and steps may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to tangible media, such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (eg, according to a communication protocol) . In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. Data storage media can be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this application. The computer program product may comprise a computer-readable medium.

By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, flash memory or may be used to store instructions or data structures desired program code in the form of any other medium that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are used to transmit instructions from a website, server, or other remote source, then the coaxial cable Wire, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave are included in the definition of media. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. As used herein, magnetic disks and optical disks include compact disks (CDs), laser disks, optical disks, DVDs, and Blu-ray disks, where disks typically reproduce data magnetically, while disks reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

may be processed by one or more of, for example, one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits to execute the instruction. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or in combination with into the combined codec. Furthermore, the techniques may be fully implemented in one or more circuits or logic elements. In one example, various illustrative logical blocks, units, and modules in the encoder 100 and the decoder 200 may be understood as corresponding circuit devices or logic elements.

The techniques of the present embodiments may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chip set). Various components, modules, or units are described in the embodiments of the present application to emphasize functional aspects of means for performing the disclosed techniques, but do not necessarily need to be implemented by different hardware units. Indeed, as described above, the various units may be combined in codec hardware units in conjunction with suitable software and/or firmware, or by interoperating hardware units (including one or more processors as described above) supply.

That is to say, the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer, or a data storage device such as a server, a data center, etc. that includes one or more available media integrated. The available media may be magnetic media (eg: floppy disk, hard disk, magnetic tape), optical media (eg: digital versatile disc (DVD)) or semiconductor media (eg: solid state disk (SSD)) Wait. It should be noted that the computer-readable storage medium mentioned in the embodiments of the present application may be a non-volatile storage medium, in other words, may be a non-transitory storage medium.

It should be understood that references herein to "a plurality" means two or more. In the description of the embodiments of the present application, unless otherwise specified, "/" means or means, for example, A/B can mean A or B; "and/or" in this document is only an association that describes an associated object Relation, it means that there can be three kinds of relations, for example, A and/or B can mean that A exists alone, A and B exist at the same time, and B exists alone. In addition, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as "first" and "second" do not limit the quantity and execution order, and the words "first" and "second" are not necessarily different.

The above-mentioned examples provided for this application are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application. Inside.

Claims

A method for encoding a dynamic image, wherein the method comprises:

Perform semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, and the image segmentation mask includes multiple image regions corresponding to the multiple objects one-to-one;

Based on the dynamic images, determine a sequence of motion images, where each frame of the image in the sequence of motion images includes an image area where one or more motion objects in the plurality of objects are located;

determining position indication information based on the image segmentation mask, where the position indication information is used to indicate the position of the image area where the one or more moving objects are located;

The moving image sequence and the position indication information are encoded into a code stream.
The method of claim 1, wherein the moving image sequence includes one or more sub-image sequences, and the position indication information is the image segmentation mask;

The determining a sequence of moving images based on the moving images includes:

Based on the image segmentation mask and the dynamic image, the one or more sub-image sequences are extracted, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects.
The method of claim 1, wherein the moving image sequence includes one or more sub-image sequences, and the location indication information includes coordinates of one or more designated locations;

The determining a sequence of moving images based on the moving images includes:

extracting the one or more sub-image sequences based on the image segmentation mask and the dynamic image, where the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects;

The determining the position indication information based on the image segmentation mask includes:

Based on the image segmentation mask, determine the coordinates in the dynamic image of a specified position within an image area where each of the one or more moving objects is located.
The method according to claim 2 or 3, wherein the extracting the one or more sub-image sequences based on the image segmentation mask and the dynamic image comprises:

A moving object is selected from the one or more moving objects, and the sub-image sequence corresponding to the selected moving object is determined according to the following operations, until the sub-image sequence corresponding to each moving object is determined:

determining the location area where the selected moving object is located based on the image segmentation mask;

Based on the location area, an image area where the selected moving object is located is extracted from each frame of the dynamic image except the first frame of image, and a sub-image sequence corresponding to the selected moving object is obtained.
The method according to claim 4, wherein the determining, based on the image segmentation mask, the location area where the selected moving object is located, comprising:

Scan each pixel in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, where the pixel coordinate set includes coordinates of a plurality of pixel points;

The location area formed by the pixel coordinate set is determined as the location area where the selected moving object is located.
The method according to claim 4 or 5, wherein, based on the location area, extracting the location of the selected moving object from each frame of the moving image except the first frame of image image area, including:

extracting an image area located in the location area from each frame of image in the dynamic image except the first frame image;

or,

Expand the position area, so that the expanded position area is a square area, and extract the position area located in the expanded position area from each frame of image in the dynamic image except the first frame image. image area.
The method according to claim 3, wherein the specified position is the position with the smallest coordinates or the position with the largest coordinates.
The method of claim 3, wherein the method further comprises:

The number of the one or more moving objects is encoded into the codestream.
The method of claim 1, wherein the moving image sequence is the moving image, and the position indication information is the image segmentation mask.
The method of claim 9, wherein the method further comprises:

based on the image segmentation mask, determining a plurality of segmentation regions corresponding to the plurality of objects one-to-one;

According to the plurality of divided regions, each frame of image in the dynamic image except the first frame of image is divided into regions to obtain a plurality of image regions;

determining an object state corresponding to each segmented region in the plurality of segmented regions, where the object state includes a static state or a motion state;

The encoding of the moving image sequence into the code stream includes:

encoding the plurality of image regions into a code stream;

The method also includes:

The object state corresponding to each of the plurality of divided regions is encoded into the code stream.
The method of claim 10, wherein the determining, based on the image segmentation mask, a plurality of segmentation regions corresponding to the plurality of objects one-to-one comprises:

determining, based on the image segmentation mask, a location area where each object in the plurality of objects is located;

In the case where the location area where any object is located in the plurality of objects does not include an integer number of coding tree units CTUs, the boundary of the location area where the any object is located is extended, so that the location area where the any object is located is extended. The location area contains an integer number of CTUs;

Determine the location regions where the multiple objects are located after the expansion process as the multiple segmented regions.
The method of claim 11, wherein the encoding the multiple image regions into the code stream comprises:

Encoding each image area in the plurality of image areas as an encoding block into the code stream respectively;

or,

Encoding the region composed of each row of CTUs in each of the plurality of image regions as a coding block into the code stream;

Wherein, the location area where the reference coding block is located is located in the location area where the referenced coding block is located.
The method according to any one of claims 1-12, wherein the method further comprises:

The first frame image of the dynamic image is encoded into the code stream.
A method for decoding a dynamic image, wherein the method comprises:

Parse the first frame image from the code stream;

A moving image sequence and position indication information are parsed from the code stream, each frame of image in the moving image sequence includes an image area where one or more moving objects are located, and the position indication information is used to indicate the one or more moving objects. or the position of the image area in which multiple moving objects are located;

Based on the moving image sequence and the position indication information, the image area where the one or more moving objects are located is rendered and displayed in the first frame of image to obtain a moving image.
The method of claim 14, wherein the moving image sequence comprises one or more sub-image sequences, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects;

The position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects.
The method according to claim 15, wherein, based on the moving image sequence and the position indication information, the image area where the one or more moving objects are located in the first frame image Render and display, including:

A moving object is selected from the one or more moving objects, and the image area where the selected moving object is located is rendered and displayed according to the following operations, until the image area where each moving object is located is rendered and displayed:

determining the position of the image region where the selected moving object is located based on the image segmentation mask;

According to the position of the image area where the selected moving object is located, the image area included in the sub-image sequence corresponding to the selected moving object is rendered and displayed in the first frame of image.
The method according to claim 16, wherein the determining, based on the image segmentation mask, the position of the image area where the selected moving object is located comprises:

Scan each pixel in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, where the pixel coordinate set includes coordinates of a plurality of pixel points;

Determine the location area formed by the pixel coordinate set as the position of the image area where the selected moving object is located, or expand the location area formed by the pixel coordinate set, so that the expanded location area is a square area, and the expanded location area is determined as the location of the image area where the selected moving object is located.
The method of claim 14, wherein the moving image sequence comprises one or more sub-image sequences, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects;

The position indication information includes coordinates in the dynamic image of a specified position within an image area where each of the one or more moving objects is located.
The method according to claim 18, wherein, based on the moving image sequence and the position indication information, the image area where the one or more moving objects are located in the first frame image Render and display, including:

A moving object is selected from the one or more moving objects, and the image area where the selected moving object is located is rendered and displayed according to the following operations, until the image area where each moving object is located is rendered and displayed:

According to the coordinates in the dynamic image of the specified position in the image area where the selected moving object is located, the image area included in the sub-image sequence corresponding to the selected moving object is performed in the first frame image. Render and display.
The method according to claim 18 or 19, wherein the specified position is the position with the smallest coordinates or the position with the largest coordinates.
The method according to any one of claims 18-20, wherein the method further comprises:

The number of the one or more moving objects is parsed from the code stream.
The method of claim 14, wherein the moving image sequence is the moving image, the position indication information is an image segmentation mask, and the image segmentation mask includes a one-to-one correspondence with a plurality of objects a plurality of image regions, the plurality of objects including the one or more moving objects.
The method according to claim 22, wherein, based on the moving image sequence and the position indication information, the image area where the one or more moving objects are located in the first frame image Render and display, including:

A moving object is selected from the one or more moving objects, and the image area where the selected moving object is located is rendered and displayed according to the following operations, until the image area where each moving object is located is rendered and displayed:

determining the position of the image region where the selected moving object is located based on the image segmentation mask;

based on the position of the image area where the selected moving object is located, extracting the image area where the selected moving object is located from each frame of images in the dynamic image except the first frame of image;

According to the position of the image area where the selected moving object is located, the image area where the selected moving object is located in each frame of the dynamic image is rendered and displayed in the first frame of image.
The method of claim 14, wherein the position indication information is an image segmentation mask, the image segmentation mask comprising a plurality of image regions corresponding to a plurality of objects one-to-one, the plurality of objects comprising the one or more moving objects;

The parsing of the moving image sequence from the code stream includes:

based on the image segmentation mask, determining a plurality of segmentation regions corresponding to the plurality of objects one-to-one;

Parse out, from the code stream, an object state corresponding to each of the plurality of divided regions, where the object state includes a static state or a motion state;

Based on the object state corresponding to each of the plurality of divided areas, the image area divided by the divided area corresponding to the motion state is parsed from the code stream to obtain the moving image sequence.
The method of claim 24, wherein the determining, based on the image segmentation mask, a plurality of segmentation regions corresponding to the plurality of objects one-to-one comprises:

determining, based on the image segmentation mask, a location area where each object in the plurality of objects is located;

In the case where the location area where any one of the multiple objects is located does not contain an integer number of CTUs, the boundary of the location area where the any object is located is extended, so that the location area where the any object is located includes an integer number of CTUs;

Determine the location regions where the multiple objects are located after the expansion process as the multiple segmented regions.
The method according to any one of claims 14-20 and 22-25, wherein before parsing the moving image sequence and position indication information from the code stream, the method further comprises:

receiving an object selection instruction, the object selection instruction being used to select one or more objects from a plurality of objects included in the dynamic image;

One or more objects selected by the object selection instruction are determined as the one or more moving objects.
The method according to any one of claims 14-26, wherein the method further comprises:

Parse out the encoder type used for encoding from the code stream;

Determine the corresponding decoder type according to the parsed encoder type.
A dynamic image encoding device, characterized in that the device comprises:

The semantic segmentation module is used to perform semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask, the dynamic image includes multiple objects, and the image segmentation mask includes a one-to-one correspondence with the multiple objects of multiple image regions;

an image sequence extraction module, configured to determine a moving image sequence based on the moving image, where each frame of image in the moving image sequence includes an image area where one or more moving objects in the plurality of objects are located;

a location indication information determination module, configured to determine location indication information based on the image segmentation mask, where the location indication information is used to indicate the location of the image area where the one or more moving objects are located;

The first encoding module is used for encoding the moving image sequence and the position indication information into a code stream.
The apparatus of claim 28, wherein the moving image sequence includes one or more sub-image sequences, and the position indication information is the image segmentation mask;

The image sequence extraction module includes:

An image sequence extraction sub-module, configured to extract the one or more sub-image sequences based on the image segmentation mask and the dynamic image, the one or more sub-image sequences are identical to the one or more moving objects. A correspondence.
The apparatus of claim 28, wherein the moving image sequence includes one or more sub-image sequences, and the location indication information includes coordinates of one or more designated locations;

The image sequence extraction module includes:

An image sequence extraction sub-module, configured to extract the one or more sub-image sequences based on the image segmentation mask and the dynamic image, the one or more sub-image sequences are identical to the one or more moving objects. one correspondence;

The location indication information determination module includes:

A position coordinate determination sub-module is configured to determine, based on the image segmentation mask, the coordinates in the dynamic image of a specified position within the image area where each of the one or more moving objects is located.
The apparatus according to claim 29 or 30, wherein the image sequence extraction submodule comprises:

The selection sub-module is used to select a moving object from the one or more moving objects, and the sub-image sequence corresponding to the selected moving object is determined by the following modules, until the sub-image sequence corresponding to each moving object is determined:

a location area determination sub-module for determining the location area where the selected moving object is located based on the image segmentation mask;

The image area extraction sub-module is used to extract the image area where the selected moving object is located from each frame of the dynamic image except the first frame image based on the position area, and obtain the selected image area. The sub-image sequence corresponding to the moving object.
The apparatus of claim 31, wherein the location area determination submodule is specifically used for:

Scan each pixel in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, where the pixel coordinate set includes coordinates of a plurality of pixel points;

The location area formed by the pixel coordinate set is determined as the location area where the selected moving object is located.
The device according to claim 31 or 32, wherein the image region extraction submodule is specifically used for:

extracting an image area located in the location area from each frame of image in the dynamic image except the first frame image;

or,

Expand the position area, so that the expanded position area is a square area, and extract the position area located in the expanded position area from each frame of image in the dynamic image except the first frame image. image area.
The device according to claim 30, wherein the designated position is the position with the smallest coordinates or the position with the largest coordinates.
The apparatus of claim 30, wherein the apparatus further comprises:

The second encoding module is configured to encode the number of the one or more moving objects into the code stream.
The apparatus of claim 28, wherein the moving image sequence is the moving image, and the position indication information is the image segmentation mask.
The apparatus of claim 36, wherein the apparatus further comprises:

A segmented region determination module, configured to determine a plurality of segmented regions corresponding to the multiple objects one-to-one based on the image segmentation mask;

an area division module, configured to perform area division on each frame of image in the dynamic image except the first frame image according to the plurality of divided areas to obtain a plurality of image areas;

an object state determination module, configured to determine an object state corresponding to each segmented region in the plurality of segmented regions, where the object state includes a static state or a motion state;

The first encoding module includes:

an image region encoding submodule, used for encoding the multiple image regions into a code stream;

The device also includes:

The third encoding module is configured to encode the object state corresponding to each of the plurality of divided regions into the code stream.
The apparatus of claim 37, wherein the segmented region determination module is specifically configured to:

determining, based on the image segmentation mask, a location area where each object in the plurality of objects is located;

In the case where the location area where any object is located in the plurality of objects does not contain an integer number of coding tree units CTUs, the boundary of the location area where the any object is located is extended, so that the location area where the any object is located is extended. The location area contains an integer number of CTUs;

Determine the location regions where the multiple objects are located after the expansion process as the multiple segmented regions.
The apparatus of claim 38, wherein the image region coding submodule is specifically used for:

Encoding each image area in the plurality of image areas as an encoding block into the code stream respectively;

or,

Encoding the region composed of each row of CTUs in each of the plurality of image regions as a coding block into the code stream;

Wherein, the location area where the reference coding block is located is located in the location area where the referenced coding block is located.
The device according to any one of claims 28-39, wherein the device further comprises:

The fourth encoding module is used for encoding the first frame of the dynamic image into the code stream.
A moving image decoding device, characterized in that the device comprises:

The image decoding module is used to parse the first frame image from the code stream;

a first decoding module, configured to parse out a moving image sequence and position indication information from the code stream, each frame of image in the moving image sequence includes an image area where one or more moving objects are located, and the position indication information is used to indicate the location of the image area in which the one or more moving objects are located;

An image synthesis module, configured to render and display the image area where the one or more moving objects are located in the first frame of image based on the moving image sequence and the position indication information, to obtain a moving image.
The apparatus of claim 41, wherein the moving image sequence comprises one or more sub-image sequences, and the one or more sub-image sequences correspond to the one or more moving objects one-to-one;

The position indication information is an image segmentation mask, and the image segmentation mask includes a plurality of image regions corresponding to a plurality of objects one-to-one, and the plurality of objects include the one or more moving objects.
The apparatus of claim 42, wherein the image synthesis module comprises:

The selection sub-module is used to select a moving object from the one or more moving objects, and the image area where the selected moving object is located is rendered and displayed by the following modules, until the image area where each moving object is located is located. Render and display so far:

a position determination submodule for determining the position of the image area where the selected moving object is located based on the image segmentation mask;

A rendering and display sub-module, configured to render and display the image area included in the sub-image sequence corresponding to the selected moving object in the first frame of image according to the position of the image area where the selected moving object is located .
The apparatus of claim 43, wherein the position determination submodule is specifically used for:

Scan each pixel in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, where the pixel coordinate set includes coordinates of a plurality of pixel points;

Determine the location area formed by the pixel coordinate set as the position of the image area where the selected moving object is located, or expand the location area formed by the pixel coordinate set, so that the expanded location area is a square area, and the expanded location area is determined as the location of the image area where the selected moving object is located.
The apparatus of claim 41, wherein the moving image sequence comprises one or more sub-image sequences, and the one or more sub-image sequences correspond to the one or more moving objects one-to-one;

The position indication information includes coordinates in the dynamic image of a specified position within an image area where each of the one or more moving objects is located.
The apparatus of claim 45, wherein the image synthesis module is specifically configured to:

A moving object is selected from the one or more moving objects, and the image area where the selected moving object is located is rendered and displayed according to the following operations, until the image area where each moving object is located is rendered and displayed:

According to the coordinates in the dynamic image of the specified position in the image area where the selected moving object is located, the image area included in the sub-image sequence corresponding to the selected moving object is performed in the first frame image. Render and display.
The device according to claim 45 or 46, wherein the designated position is the position with the smallest coordinates or the position with the largest coordinates.
The device according to any one of claims 45-47, wherein the device further comprises:

The second decoding module is configured to parse out the quantity of the one or more moving objects from the code stream.
The apparatus of claim 41, wherein the moving image sequence is the moving image, the position indication information is an image segmentation mask, and the image segmentation mask includes a one-to-one correspondence with a plurality of objects a plurality of image regions, the plurality of objects including the one or more moving objects.
The apparatus of claim 49, wherein the image synthesis module is specifically used for:

A moving object is selected from the one or more moving objects, and the image area where the selected moving object is located is rendered and displayed according to the following operations, until the image area where each moving object is located is rendered and displayed:

determining the position of the image region where the selected moving object is located based on the image segmentation mask;

based on the position of the image area where the selected moving object is located, extracting the image area where the selected moving object is located from each frame of images in the dynamic image except the first frame of image;

According to the position of the image area where the selected moving object is located, the image area where the selected moving object is located in each frame of the dynamic image is rendered and displayed in the first frame of image.
The apparatus of claim 41, wherein the position indication information is an image segmentation mask, the image segmentation mask comprising a plurality of image regions corresponding to a plurality of objects one-to-one, the plurality of objects comprising the one or more moving objects;

The first decoding module includes:

a segmentation area determination submodule, configured to determine a plurality of segmentation areas corresponding to the multiple objects one-to-one based on the image segmentation mask;

an object state determination submodule, configured to parse out the object state corresponding to each of the plurality of divided regions from the code stream, and the object state includes a static state or a motion state;

An image area decoding sub-module, configured to analyze the image area divided by the divided area corresponding to the motion state from the code stream based on the object state corresponding to each divided area in the plurality of divided areas, and obtain the motion image sequence.
The apparatus according to claim 51, wherein the sub-module for determining the segmented region is specifically configured to:

determining, based on the image segmentation mask, a location area where each object in the plurality of objects is located;

In the case where the location area where any one of the multiple objects is located does not contain an integer number of CTUs, the boundary of the location area where the any object is located is extended, so that the location area where the any object is located includes an integer number of CTUs;

Determine the location regions where the multiple objects are located after the expansion process as the multiple segmented regions.
The device according to any one of claims 41-47 and 49-52, wherein the device further comprises:

an instruction receiving module, configured to receive an object selection instruction, where the object selection instruction is used to select one or more objects from a plurality of objects included in the dynamic image;

A moving object determination module, configured to determine one or more objects selected by the object selection instruction as the one or more moving objects.
The device according to any one of claims 41-53, wherein the device further comprises:

A third decoding module, configured to parse out the encoder type used for encoding from the code stream;

The decoder type determination module is used to determine the corresponding decoder type according to the parsed encoder type.
A coding end device, characterized in that the coding end device comprises a memory and a processor;

The memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so as to realize the encoding method of a dynamic image according to any one of claims 1-13.
A decoding end device, characterized in that the decoding end device includes a memory and a processor;

The memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so as to realize the decoding method of a dynamic image according to any one of claims 14-27.
A computer-readable storage medium, wherein instructions are stored in the storage medium, and when the instructions are executed on the computer, the computer is made to execute the method of any one of claims 1-27. step.