CN115225901A

CN115225901A - Method, device, equipment and storage medium for encoding and decoding dynamic image

Info

Publication number: CN115225901A
Application number: CN202110421196.1A
Authority: CN
Inventors: 闫宁; 陈焕浜; 李照洋; 马飞龙; 宋星光; 周建同; 杨海涛; 李江
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-10-21
Also published as: WO2022222842A1

Abstract

The embodiment of the application discloses a method, a device and equipment for coding and decoding a dynamic image and a storage medium, belonging to the technical field of coding and decoding. In the method, any frame of image in a dynamic image is subjected to semantic segmentation to obtain an image segmentation mask, the dynamic image comprises a plurality of objects, and the image segmentation mask comprises a plurality of image areas in one-to-one correspondence with the plurality of objects. Based on the dynamic images, a moving image sequence is determined, each frame of image in the moving image sequence including an image area where one or more moving objects of the plurality of objects are located. Based on the image segmentation mask, position indication information is determined, the position indication information indicating a position of an image region in which the one or more moving objects are located. And coding the motion image sequence and the position indication information into a code stream. The embodiment of the application improves the coding efficiency and effectively reduces the decoding complexity and power consumption.

Description

Method, device, equipment and storage medium for encoding and decoding dynamic image

Technical Field

The present invention relates to the field of encoding and decoding technologies, and in particular, to a method, an apparatus, a device, and a storage medium for encoding and decoding a dynamic image.

Background

The dynamic image is a media format between a static image and a video, and is an image in which a group of static images is switched at a designated frequency to generate a dynamic effect. Compared with a still image, a moving image has a plurality of frames of images between which a temporal correlation exists. Compared with video, dynamic images have less inter-frame correlation and no fixed frame rate.

Compared with video, the current dynamic image codec has the characteristics of light weight and low power consumption, and does not adopt streaming transmission. The most widely used encoding and decoding method for dynamic images at present is the image interchange format (GIF), but the encoding and decoding method has low encoding efficiency and poor encoding image quality, and is increasingly difficult to meet the application requirements of the dynamic images with high resolution at present.

Disclosure of Invention

The embodiment of the application provides a dynamic image coding and decoding method, device, equipment and storage medium, which can improve coding efficiency and reduce decoding complexity and power consumption. The technical scheme is as follows:

in a first aspect, a method for encoding a dynamic image is provided, in which a semantic segmentation is performed on any frame of image in the dynamic image to obtain an image segmentation mask, the dynamic image includes a plurality of objects, and the image segmentation mask includes a plurality of image regions in one-to-one correspondence with the plurality of objects. Based on the dynamic images, a moving image sequence is determined, each frame of image in the moving image sequence including an image area where one or more moving objects of the plurality of objects are located. Based on the image segmentation mask, position indication information is determined, the position indication information indicating a position of an image area in which the one or more moving objects are located. And coding the motion image sequence and the position indication information into a code stream.

Because only the image area where the moving object is located in the dynamic image can be changed, the image area where the static object is located can not be changed, each frame of image in the moving image sequence comprises the image area where one or more moving objects in the multiple objects are located, and the position indication information is used for indicating the position of the image area where the one or more moving objects are located, the moving image sequence and the position indication information are coded into the code stream, so that the dynamic image can be decoded in the follow-up manner, the image area where the static object is located does not need to be coded into the code stream, and the coding efficiency is improved.

Because the position area of each object in the dynamic image is basically unchanged, and only the object changes, the embodiment of the application can perform semantic segmentation on any frame of image in the dynamic image to obtain the image segmentation mask. In general, semantic segmentation may be performed on a first frame image in a moving image to obtain an image segmentation mask.

In addition, since the image segmentation mask includes a plurality of image regions corresponding to the plurality of objects one to one, in order to facilitate the distinction between the objects, the image regions corresponding to the plurality of objects are generally represented by different pixel values, and the image regions corresponding to the same object are represented by the same pixel value.

It should be noted that each object in the dynamic image may be a single individual in the dynamic image. For example, in the case where the dynamic image includes a user, a lawn, a hill, a river, and a sky, the plurality of objects in the dynamic image include the user, the lawn, the hill, the river, and the sky.

In addition, the moving image includes a plurality of objects that are generally divided into moving objects and still objects. A moving object is an object that has a change in itself, and may be referred to as an object in a moving state. For example, the water in a river in the moving image varies, and the five sense organs or limbs of the user vary, so the river and the user can be referred to as a moving object. The stationary object is an object having no variation, and may be referred to as an object in a stationary state, for example, a grassland, a hill, and a sky in a moving image have no variation, and thus, the grassland, the hill, and the sky may be referred to as a stationary object.

It should be noted that the moving image sequence may include one or more sub-image sequences corresponding to the one or more moving objects one to one, or may be the moving image itself. The position indication information may be an image segmentation mask, or may be coordinates of a specified position of an image area where each of the one or more moving objects is located in the dynamic image. Therefore, the following description will be divided into a plurality of cases.

In the first case, the moving image sequence includes one or more sub-image sequences, and the position indication information is an image division mask.

In the first case, the implementation of determining a sequence of moving images based on the moving images comprises: one or more sub-image sequences are extracted based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences correspond to one or more moving objects one to one.

The extraction mode of the sub-image sequence corresponding to each moving object is the same, so that one moving object can be selected from the one or more moving objects, and the sub-image sequence corresponding to the selected moving object is determined according to the following operations until the sub-image sequence corresponding to each moving object is determined: and determining a position area where the selected moving object is located based on the image segmentation mask, and extracting an image area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image based on the position area where the selected moving object is located to obtain a sub-image sequence corresponding to the selected moving object.

Since the image segmentation mask includes a plurality of image regions corresponding to the plurality of objects one to one, that is, the image region of each object in the plurality of objects is already divided, and based on the above description, the image regions of the same object in the image segmentation mask are represented by the same pixel value, and the image regions of different objects are represented by different pixel values. Therefore, the implementation process of determining the position area where the selected moving object is located based on the image segmentation mask includes: and scanning each pixel point in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, wherein the pixel coordinate set comprises coordinates of a plurality of pixel points. And determining a position area formed by a pixel coordinate set corresponding to the selected moving object as the position area where the selected moving object is located.

That is, by scanning each pixel point in the image segmentation mask, the pixel point whose pixel value is the pixel value corresponding to the selected moving object is determined, the coordinates of the pixel points are determined as the pixel coordinate set corresponding to the selected moving object, and the position area where the selected moving object is located can be determined, where the position area is the position where the selected moving object is actually located, and the boundary of the position area is the outline of the selected moving object.

In general, the area formed by the contour of the moving object is an irregular area, that is, the position area where the moving object is located is not a regular area, and therefore, the image area in the position area where the moving object is located can be directly extracted from each frame of image except the first frame of image in the dynamic image. Of course, in other embodiments, the position area where the moving object is located may also be processed into a regular area, and then the image area in the regular area is extracted from each frame of image in the dynamic image except the first frame of image.

That is, based on the position area of the selected moving object, the implementation process of extracting the image area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image includes: and extracting an image area located in the position area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image. Or, the position area where the selected moving object is located is expanded to enable the expanded position area to be a square area, and an image area located in the expanded position area is extracted from each frame of image except the first frame of image in the dynamic image.

It should be noted that, various implementation manners for expanding the position area where the moving object is located include, for example, determining a minimum abscissa, a minimum ordinate, a maximum abscissa, and a maximum ordinate from a set of pixel coordinates corresponding to the selected moving object, then determining a square area whose abscissa is between the minimum abscissa and the maximum abscissa and whose ordinate is between the minimum ordinate and the maximum ordinate, and determining the square area as the expanded position area. Or, drawing a circumscribed square region of the position region directly based on the position region where the moving object is located, and determining the circumscribed square region as the expanded position region.

In the second case, the moving image sequence includes one or more sub-image sequences, and the position indication information includes coordinates of one or more specified positions.

In the second case, the implementation of determining a sequence of moving images based on the moving images comprises: one or more sub-image sequences are extracted based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences correspond to one or more moving objects one to one. At this time, the implementation process of determining the position indication information based on the image segmentation mask includes: based on the image segmentation mask, coordinates of a specified position in the dynamic image within an image region in which each of the one or more moving objects is located are determined.

The content in the second case may refer to the related description in the first case, and details thereof are not repeated in this embodiment of the application.

The designated position in the image region where the moving object is located may be a position with the minimum coordinates, may be a position with the maximum coordinates, or may be a position of a geometric center point. Of course, other positions may also be adopted, which is not limited in the embodiments of the present application.

Optionally, in the second case, the number of the one or more moving objects may also be coded into the codestream. In this way, for the decoding end, it can be determined whether the sub-image sequence with transmission failure exists in the one or more sub-image sequences based on the number of the one or more moving objects, so as to ensure the reliability of dynamic image decoding.

In the third case, the moving image sequence is a moving image, and the position indication information is an image division mask.

In the fourth case, the moving image sequence is a moving image, and the position indication information is an image division mask. At this time, the method further includes: and determining a plurality of segmentation areas which are in one-to-one correspondence with the plurality of objects based on an image segmentation mask, and performing area division on each frame of image except the first frame of image in the dynamic image according to the plurality of segmentation areas to obtain a plurality of image areas. And determining the object state corresponding to each of the plurality of segmented regions, wherein the object state comprises a static state or a motion state. Thus, the implementation process of coding the moving image sequence into the code stream comprises the following steps: and coding the plurality of image areas into a code stream. The method further comprises the following steps: and encoding the object state corresponding to each of the plurality of the division areas into the code stream.

Based on the image segmentation mask, the implementation process of determining a plurality of segmentation regions in one-to-one correspondence with the plurality of objects comprises the following steps: based on the image segmentation mask, a location area is determined in which each of the plurality of objects is located. When the location area in which any one of the plurality of objects is located does not include an integer number of Code Tree Units (CTUs), the boundary of the location area in which the any one object is located is extended so that the location area in which the any one object is located includes an integer number of CTUs. And determining the position areas where the plurality of objects are positioned after the expansion processing as the plurality of segmentation areas.

That is, after the expansion processing is performed, the location area where each object is located includes an integer number of CTUs. At this time, the expanded position region may be determined as the divided region. That is, each of the plurality of partitioned areas includes an integer number of CTUs.

In this case, the implementation process of encoding the plurality of image areas into the code stream includes: and coding each image area in the plurality of image areas into the code stream as a coding block. Or coding an area formed by each row of CTUs in each image area in the plurality of image areas into the code stream as a coding block. Wherein, the position area of the reference coding block is positioned in the position area of the referenced coding block.

Because each image area comprises an integer number of CTUs, the whole image area (tile) is independently coded into a code stream as a coding block, or an area (slice) formed by each row of CTUs in each image area is independently coded into the code stream as a coding block, so that the subsequent decoding can be independently carried out.

In addition, for a certain coding block, the decoding of the coding block may need to refer to a coding block in a certain frame image before the current frame, that is, the decoding of the certain coding block in the current frame depends on the coding block in the reference frame, so to be able to successfully decode, it is necessary to define a position area where the coding block in the reference frame is located in a position area where the coding block in the current frame is located, so that the current coding block can be decoded on the basis of the reference coding block.

For the above four cases, the first frame image of the dynamic image may be encoded into the code stream.

In a second aspect, a method for decoding a dynamic image is provided, in which a first frame image is parsed from a code stream, and a moving image sequence and position indication information are parsed from the code stream, where each frame image in the moving image sequence includes an image area where one or more moving objects are located, and the position indication information is used for indicating the position of the image area where the one or more moving objects are located. And rendering and displaying an image area where the one or more moving objects are positioned in the first frame image based on the moving image sequence and the position indication information to obtain a dynamic image.

That is, when decoding a moving image, after the first frame image is decoded, only the image area where the moving object is located needs to be decoded for the subsequent image, and the image area where the still object is located does not need to be decoded, thereby effectively reducing the decoding complexity and power consumption. In addition, in the display process of the dynamic image, the image area where the moving object is located is rendered and refreshed to be displayed only on the basis of the first frame image, so that the display power consumption is effectively reduced.

In the first case, the moving image sequence includes one or more sub-image sequences, which are in one-to-one correspondence with one or more moving objects. The position indication information is an image segmentation mask including a plurality of image regions in one-to-one correspondence with a plurality of objects including the one or more moving objects.

In the first case, the implementation process of rendering and displaying the image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information includes: selecting one moving object from the one or more moving objects, and rendering and displaying an image area where the selected moving object is located according to the following operations until the image area where each moving object is located is rendered and displayed: based on the image segmentation mask, the position of the image region where the selected moving object is located is determined. And according to the position of the image area where the selected moving object is located, rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image.

Since the image segmentation mask includes a plurality of image regions corresponding to the plurality of objects one to one, that is, the image region of each object in the plurality of objects is already divided, and based on the above description, the image regions of the same object in the image segmentation mask are represented by the same pixel value, and the image regions of different objects are represented by different pixel values. Therefore, the implementation process of determining the position of the image region where the selected moving object is located based on the image segmentation mask includes: and scanning each pixel point in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, wherein the pixel coordinate set comprises coordinates of a plurality of pixel points. The position area constituted by the set of pixel coordinates is determined as the position of the image area where the selected moving object is located, or the position area constituted by the set of pixel coordinates is expanded so that the expanded position area is a square area, and the expanded position area is determined as the position of the image area where the selected moving object is located.

That is, by scanning each pixel point in the image segmentation mask, the pixel point whose pixel value is the pixel value corresponding to the selected moving object is determined, the coordinates of the pixel points are determined as the pixel coordinate set corresponding to the selected moving object, and the position of the image area where the selected moving object is located in the dynamic image can be determined.

In general, the area formed by the contour of the moving object is an irregular area, that is, the location area formed by the set of pixel coordinates is not a regular area, and therefore, in some embodiments, the location area formed by the set of pixel coordinates corresponding to the moving object may be directly determined as the location of the image area where the moving object is located in the dynamic image. Of course, in other embodiments, the position area formed by the set of pixel coordinates may also be processed as a regular area, and then the position of the regular area is determined as the position of the image area where the moving object is located in the dynamic image.

In the second case, the moving image sequence comprises one or more sub-image sequences, which are in one-to-one correspondence with the one or more moving objects. The position indication information includes coordinates in the dynamic image of a specified position within an image area where each of the one or more moving objects is located.

In the second case, the implementation process of rendering and displaying the image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information includes: selecting one moving object from the one or more moving objects, and rendering and displaying an image area where the selected moving object is located according to the following operations until the image area where each moving object is located is rendered and displayed: and rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image according to the coordinate of the designated position of the image area in which the selected moving object is positioned in the dynamic image.

Since the encoding end directly encodes the coordinates of the designated position in the dynamic image in the image area where each moving object is located into the code stream, in the embodiment of the present application, after the coordinates of the designated position corresponding to the selected moving object are analyzed from the code stream, the image area included in the sub-image sequence corresponding to the selected moving object can be directly rendered and displayed in the first frame image, and the image reconstruction speed is improved.

The designated position in the image region where the moving object is located may be a position with the minimum coordinates, a position with the maximum coordinates, or a position of a geometric center point. Of course, other positions may also be adopted, which is not limited in the embodiments of the present application.

Under the condition that the coding end codes the number of the one or more moving objects into the code stream, the embodiment of the application can also analyze the number of the one or more moving objects from the code stream. In this way, by comparing the number of the one or more moving objects with the number of the one or more sub-image sequences, it can be determined whether there is a sub-image sequence with failed transmission in the one or more sub-image sequences, thereby improving the reliability of dynamic image decoding.

In a third case, the moving image sequence is a moving image, and the position indication information is an image division mask including a plurality of image regions in one-to-one correspondence with a plurality of objects including the one or more moving objects.

In a third case, the implementation process of rendering and displaying the image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information includes: selecting one moving object from the one or more moving objects, and rendering and displaying an image area where the selected moving object is located according to the following operations until the image area where each moving object is located is rendered and displayed: and determining the position of the image area where the selected moving object is positioned based on the image segmentation mask, and extracting the image area where the selected moving object is positioned from each frame of image except the first frame of image in the dynamic image based on the position of the image area where the selected moving object is positioned. And according to the position of the image area where the selected moving object is positioned, rendering and displaying the image area where the selected moving object is positioned in each frame image of the dynamic image in the first frame image.

In a fourth case, the position indication information is an image segmentation mask including a plurality of image regions in one-to-one correspondence with a plurality of objects including the one or more moving objects. At this time, the implementation process of parsing the moving image sequence from the code stream includes: determining a plurality of segmentation areas corresponding to a plurality of objects one to one based on an image segmentation mask, and analyzing an object state corresponding to each segmentation area in the plurality of segmentation areas from a code stream, wherein the object state comprises a static state or a motion state. And analyzing an image area divided by the segmentation area corresponding to the motion state from the code stream based on the object state corresponding to each segmentation area in the plurality of segmentation areas to obtain the motion image sequence.

The implementation process of determining a plurality of segmentation regions in one-to-one correspondence with the plurality of objects based on the image segmentation mask includes: based on the image segmentation mask, a location area is determined in which each of the plurality of objects is located. When the position area of any object in the plurality of objects does not include an integer number of CTUs, the boundary of the position area of any object is expanded so that the position area of any object includes an integer number of CTUs. And determining the position areas where the plurality of objects are positioned after the expansion processing as the plurality of segmentation areas.

That is, after the expansion processing, the location area where each object is located includes an integer number of CTUs. At this time, the expanded position region may be determined as the divided region. That is, each of the plurality of partitioned areas includes an integer number of CTUs.

For the above four cases, the above-mentioned one or more moving objects may be all moving objects in a plurality of objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects in the plurality of objects. That is, for moving objects in a moving image, it can be determined at the decoding end whether all the moving objects are in a moving state or a part of the moving objects need to be screened out again.

That is, an object selection instruction for selecting one or more objects from a plurality of objects included in a moving image is received. And determining one or more objects selected by the object selection instruction as one or more moving objects in the above steps.

The object selection instruction may be triggered by a user based on the first frame image, for example, all moving objects in the dynamic image are marked in the first frame image, the user may select some or all objects from all moving objects in the first frame image, and the selected objects are one or more moving objects in the above steps.

In addition, the type of the encoder for encoding at the encoding end may be predetermined by the encoding end and the decoding end, or may be selected by a user at the encoding end. Under the condition of user selection, the encoding end also needs to encode the encoder type for encoding into the code stream. For the decoding end, the encoder type used for encoding needs to be analyzed from the code stream, and the corresponding decoder type is determined according to the analyzed encoder type, so that the image or the image sequence is analyzed from the code stream according to the determined decoder type.

In a third aspect, there is provided a moving picture encoding apparatus having a function of realizing the above-described behavior of the moving picture encoding method in the first aspect. The encoding apparatus includes at least one module, and the at least one module is configured to implement the method for encoding a moving image provided in the first aspect.

In a fourth aspect, there is provided a moving picture decoding apparatus having a function of realizing the behavior of the moving picture decoding method in the second aspect described above. The decoding apparatus includes at least one module, and the at least one module is configured to implement the method for decoding a moving image according to the second aspect.

In a fifth aspect, an encoding side apparatus is provided, and the encoding side apparatus includes a processor and a memory, and the memory is used for storing a program for executing the method for encoding a dynamic image provided in the first aspect. The processor is configured to execute a program stored in the memory to implement the moving image encoding method provided in the first aspect described above.

Optionally, the encoding end device may further include a communication bus, and the communication bus is used for establishing connection between the processor and the memory.

In a sixth aspect, there is provided a decoding-side apparatus comprising a processor and a memory for storing a program for executing the moving image decoding method provided in the second aspect. The processor is configured to execute the program stored in the memory to implement the moving image decoding method provided in the second aspect described above.

Optionally, the decoding-side device may further include a communication bus, and the communication bus is used for establishing connection between the processor and the memory.

In a seventh aspect, there is provided a computer-readable storage medium, in which instructions are stored, which instructions, when executed on a computer, cause the computer to perform the steps of the method for encoding a moving image according to the first aspect or the steps of the method for decoding a moving image according to the second aspect.

In an eighth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the method for encoding a moving image according to the first aspect described above, or the steps of the method for decoding a moving image according to the second aspect described above.

The technical effects obtained by the third aspect, the fourth aspect, the fifth aspect, the sixth aspect, the seventh aspect and the eighth aspect are similar to the technical effects obtained by the corresponding technical means in the first aspect or the second aspect, and are not described herein again.

The technical scheme provided by the embodiment of the application can at least bring the following beneficial effects:

because only the image area where the moving object is located in the dynamic image can be changed, the image area where the static object is located can not be changed, each frame of image in the moving image sequence comprises the image area where one or more moving objects in the multiple objects are located, and the position indication information is used for indicating the position of the image area where the one or more moving objects are located, the moving image sequence and the position indication information are coded into the code stream, so that the dynamic image can be decoded in the subsequent process, the image area where the static object is located does not need to be coded into the code stream, and the coding efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an exemplary implementation environment provided by an embodiment of the application;

fig. 3 is a schematic structural block diagram of an encoder provided in an embodiment of the present application;

fig. 4 is a schematic structural block diagram of a decoder provided in an embodiment of the present application;

fig. 5 is a flowchart of a first method for encoding a moving image according to an embodiment of the present application;

fig. 6 is a flowchart of a first method for decoding a moving image according to an embodiment of the present application;

fig. 7 is a block diagram of a first exemplary encoding and decoding method provided in an embodiment of the present application;

fig. 8 is a block diagram of a second exemplary encoding and decoding method provided in an embodiment of the present application;

fig. 9 is a flowchart of a second dynamic image encoding method according to an embodiment of the present application;

fig. 10 is a flowchart of a second dynamic image decoding method according to an embodiment of the present application;

fig. 11 is a block diagram of a third exemplary coding and decoding method provided in an embodiment of the present application;

fig. 12 is a block diagram of a fourth exemplary encoding and decoding method provided in an embodiment of the present application;

fig. 13 is a flowchart of a third method for encoding a moving image according to an embodiment of the present application;

fig. 14 is a flowchart of a third method for decoding a moving picture according to an embodiment of the present application;

fig. 15 is a flowchart of a fourth method for encoding a moving image according to an embodiment of the present application;

fig. 16 is a flowchart of a fourth dynamic image decoding method according to an embodiment of the present application;

fig. 17 is a block diagram of a fifth exemplary coding and decoding method provided in an embodiment of the present application;

fig. 18 is a schematic structural diagram of an apparatus for encoding a moving picture according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a dynamic image decoding apparatus according to an embodiment of the present application;

fig. 20 is a schematic block diagram of a coding and decoding device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

Before explaining the method for encoding and decoding a moving picture provided in the embodiments of the present application in detail, terms and implementation environments related to the embodiments of the present application will be described.

For ease of understanding, terms referred to in the embodiments of the present application are explained first.

And (3) encoding: refers to a processing procedure of compressing an image to be coded into a code stream. Among them, encoding is mainly divided into image encoding and video encoding. The image coding is a processing process of compressing a still image to be coded into a code stream, and the video coding is a processing process of compressing an image sequence included in a video to be coded into a code stream.

The moving picture is a picture in which a group of still pictures is switched at a predetermined frequency to generate a moving effect, and in the embodiment of the present application, the encoding of the moving picture is divided into encoding of still pictures and encoding of video.

Note that the still image may be referred to as an encoded still image after being compressed into a bitstream, and the video may be referred to as an encoded video after being compressed into a bitstream. Similarly, for encoding of moving images, the moving images compressed into a code stream may also be referred to as encoded moving images.

And (3) decoding: the method refers to a processing process of restoring the coded code stream into a reconstructed image according to a specific grammar rule and a specific processing method. The decoding is mainly divided into decoding of an image code stream and decoding of a video code stream. The decoding of the image code stream refers to a processing process of restoring the image code stream into a reconstructed image, and the decoding of the video code stream refers to a processing process of restoring the video code stream into a reconstructed video.

Sequence of sub-images: refers to a sequence of image regions extracted from each frame of image included in an image sequence.

Coding a block: the method refers to a coding region obtained by dividing an image to be coded, wherein one frame of image can be divided into a plurality of coding blocks, and the plurality of coding blocks jointly form the frame of image. Wherein each coding block is capable of being independently coded.

A tile (tile) may also be composed of slices, where a tile includes at least one Coding Tree Unit (CTU) and a slice includes multiple CTUs.

Next, an implementation environment related to the embodiments of the present application will be described.

Referring to fig. 1, fig. 1 is a schematic diagram of an implementation environment according to an embodiment of the present disclosure. The implementation environment includes a source device 10, a destination device 20, a link 30, and a storage device 40. The source device 10 may generate an encoded moving image, among others. Therefore, the source device 10 may also be referred to as a moving image encoding device. Destination device 20 may decode the encoded moving image generated by source device 10. Therefore, the destination device 20 may also be referred to as a moving picture decoding device. Link 30 may receive encoded moving pictures generated by source device 10 and may transmit the encoded moving pictures to destination device 20. Storage device 40 may receive the encoded moving image generated by source device 10 and may store the encoded moving image, on which condition destination device 20 may retrieve the encoded moving image directly from storage device 40. Alternatively, storage device 40 may correspond to a file server or another intermediate storage device that may hold the encoded dynamic images generated by source device 10, in which case destination device 20 may stream or download the encoded dynamic images stored by storage device 40.

Source device 10 and destination device 20 may each include one or more processors and memory coupled to the one or more processors that may include Random Access Memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory, any other medium that may be used to store desired program code in the form of computer-accessible instructions or data structures, and the like. For example, source device 10 and destination device 20 may each comprise a desktop computer, a mobile computing device, a notebook (e.g., laptop) computer, a tablet computer, a set-top box, a telephone handset such as a so-called "smart" phone, a television, a camera, a display device, a digital media player, a video game console, an on-board computer, or the like.

Link 30 may include one or more media or devices capable of transmitting encoded moving images from source device 10 to destination device 20. In one possible implementation, link 30 may include one or more communication media that enable source device 10 to transmit encoded moving images directly to destination device 20 in real-time. In the embodiment of the present application, the source device 10 may modulate the encoded moving image according to a communication standard, which may be a wireless communication protocol or the like, and may transmit the modulated moving image to the destination device 20. The one or more communication media may include wireless and/or wired communication media, for example, the one or more communication media may include a Radio Frequency (RF) spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, which may be a local area network, a wide area network, or a global network (e.g., the internet), among others. The one or more communication media may include a router, a switch, a base station, or other devices that facilitate communication from source device 10 to destination device 20, and the like, which is not specifically limited in this embodiment.

In one possible implementation, the storage device 40 may store the received encoded moving image transmitted by the source device 10, and the destination device 20 may directly retrieve the encoded moving image from the storage device 40. In such a case, the storage device 40 may include any one of a variety of distributed or locally accessed data storage media, such as a hard disk drive, a blu-ray disc, a Digital Versatile Disc (DVD), a compact disc read-only memory (CD-ROM), a flash memory, a volatile or non-volatile memory, or any other suitable digital storage media for storing encoded dynamic images.

In one possible implementation, storage device 40 may correspond to a file server or another intermediate storage device that may hold encoded dynamic images generated by source device 10, and destination device 20 may stream or download the dynamic images stored by storage device 40. The file server may be any type of server capable of storing encoded moving images and transmitting the encoded moving images to the destination device 20. In one possible implementation, the file server may include a network server, a File Transfer Protocol (FTP) server, a Network Attached Storage (NAS) device, a local disk drive, or the like. The destination device 20 may acquire the encoded moving image through any standard data connection, including an internet connection. Any standard data connection may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., a Digital Subscriber Line (DSL), cable modem, etc.), or a combination of both suitable for acquiring encoded moving images stored on a file server. The transmission of the encoded dynamic image from storage device 40 may be a streaming transmission, a download transmission, or a combination of both.

The implementation environment shown in fig. 1 is only one possible implementation manner, and the technology of the embodiment of the present application may be applied not only to the source device 10 capable of encoding a moving image and the destination device 20 capable of decoding an encoded moving image shown in fig. 1, but also to other devices capable of encoding and decoding a moving image, which is not specifically limited in the embodiment of the present application.

In the implementation environment shown in fig. 1, source device 10 includes a data source 120, an encoder 100, and an output interface 140. In some embodiments, output interface 140 may include a regulator/demodulator (modem) and/or a transmitter, which may also be referred to as a transmitter. Data source 120 may include an image capture device (e.g., a camera, etc.), an archive containing previously captured dynamic images, a feed interface for receiving dynamic images from a dynamic image content provider, and/or a computer graphics system for generating dynamic images, or a combination of these sources of dynamic images.

The data source 120 may transmit the moving image to the encoder 100, and the encoder 100 may encode the received moving image transmitted by the data source 120 to obtain an encoded moving image. The encoder may send the encoded moving image to the output interface. In some embodiments, source device 10 sends the encoded dynamic image directly to destination device 20 via output interface 140. In other embodiments, the encoded dynamic image may also be stored onto storage device 40 for later retrieval by destination device 20 and use in decoding and/or display.

In the implementation environment shown in fig. 1, destination device 20 includes an input interface 240, a decoder 200, and a display device 220. In some embodiments, input interface 240 includes a receiver and/or a modem. The input interface 240 may receive the encoded moving picture via the link 30 and/or from the storage device 40 and then transmit it to the decoder 200, and the decoder 200 may decode the received encoded moving picture to obtain a decoded moving picture. The decoder may transmit the decoded moving image to the display device 220. Display device 220 may be integrated with destination device 20 or may be external to destination device 20. In general, display device 220 displays the decoded dynamic image. The display device 220 may be any one of a plurality of types of display devices, for example, the display device 220 may be a Liquid Crystal Display (LCD), a plasma display, an organic light-emitting diode (OLED) display, or other types of display devices.

Although not shown in fig. 1, in some aspects, encoder 100 and decoder 200 may each be integrated with an audio encoder and decoder, and may include appropriate multiplexer-demultiplexer (MUX-DEMUX) units or other hardware and software for encoding both audio and video in a common data stream or separate data streams. In some embodiments, the MUX-DEMUX unit may conform to the ITU H.223 multiplexer protocol, or other protocols such as User Datagram Protocol (UDP), if applicable.

Encoder 100 and decoder 200 may each be any of the following circuits: one or more microprocessors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, or any combination thereof. If the techniques of embodiments of the present application are implemented in part in software, a device may store instructions for the software in a suitable non-volatile computer-readable storage medium and may execute the instructions in hardware using one or more processors to implement the techniques of embodiments of the present application. Any of the foregoing, including hardware, software, a combination of hardware and software, etc., can be considered as one or more processors. Each of the encoder 100 and decoder 200 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (codec) in the respective device.

Embodiments of the present application may generally refer to encoder 100 as "signaling" or "sending" certain information to another device, such as decoder 200. The terms "signaling" or "sending" may generally refer to the transfer of syntax elements and/or other data used to decode a compressed motion picture. This transfer may occur in real time or near real time. Alternatively, such communication may occur over a period of time, such as may occur when, at the time of encoding, syntax elements are stored in the encoded bitstream to a computer-readable storage medium, which the decoding device may then retrieve at any time after the syntax elements are stored to such medium.

Referring to fig. 2, fig. 2 is a schematic diagram of an exemplary implementation environment provided by an embodiment of the present application. The implementation environment comprises a cloud server 101 and a terminal device 201, wherein the cloud server 101 is in communication connection with the terminal device 201. The communication connection may be a wireless connection or a wired connection, which is not limited in this embodiment of the present application.

The cloud server 101 may be the source device 10 in the implementation environment shown in fig. 1. The cloud server 101 is configured to encode the moving image based on the video, and transmit the encoded moving image to the terminal device 201.

The terminal device 201 may be the destination apparatus 20 in the implementation environment shown in fig. 1 described above. The terminal device 201 is configured to decode the encoded dynamic image transmitted by the cloud server 101, and display the dynamic image obtained after decoding.

Optionally, the terminal device 201 is further configured to collect an image, transmit the collected image to the cloud server 101, and generate a dynamic image by the cloud server 101 based on the image collected by the terminal device 201, so as to provide a data source for the cloud server 101.

The terminal device 201 may be any electronic product that can perform human-computer interaction with a user through one or more modes such as a keyboard, a touch pad, a touch screen, a remote controller, voice interaction, or handwriting equipment, for example, a Personal Computer (PC), a mobile phone, a smart phone, a Personal Digital Assistant (PDA), a wearable device, a pocket PC (Personal Digital Assistant), a tablet computer, a smart car, a smart television, a smart speaker, and the like.

The cloud server 101 may be one server, a server cluster composed of multiple servers, or a cloud computing service center.

Those skilled in the art should understand that the terminal device 201 and the cloud server 101 are only examples, and other existing or future terminals or servers, such as those applicable to the embodiments of the present application, should also be included in the scope of the embodiments of the present application, and are also included herein by reference.

Referring to fig. 3, fig. 3 is a schematic structural block diagram of an encoder 100 according to an embodiment of the present disclosure. The encoder 100 includes an encoding mode determining module 110, a semantic segmentation module 111, an image sequence extracting module 112, a position indication information encoding module 113, an image encoding module 114, a first video encoding module 115, a first code stream packaging module 116, a second video encoding module 117, and a second code stream packaging module 118.

The encoding mode determining module 110 is configured to determine an encoding mode of the moving image, that is, to determine whether the moving image is encoded in a region division encoding mode or a video encoding mode. The region division coding mode refers to a coding mode provided in the embodiments of the present application, and the video coding mode refers to a conventional coding mode. That is, the dynamic image may be encoded according to the encoding mode provided in the embodiment of the present application, or may be encoded according to a conventional video encoding mode.

When the moving picture is encoded in the region segmentation encoding mode, the encoder 100 includes a semantic segmentation module 111, an image sequence extraction module 112, a position indication information encoding module 113, an image encoding module 114, a first video encoding module 115, and a first code stream packaging module 116. When the moving picture is encoded in the video encoding mode, the encoder 100 includes a second video encoding module 117 and a second bitstream packing module 118.

The semantic segmentation module 111 is configured to perform semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask. The image sequence extracting module 112 is configured to extract a moving image sequence from a moving image, where the moving image sequence may be a sub-image sequence corresponding to a moving object in the moving image, or may be the moving image itself. The position indication information encoding module 113 is configured to encode position indication information to obtain a code stream including the encoded position indication information, where the position indication information may be an image segmentation mask or a coordinate of a specified position in an image region where a moving object is located in a dynamic image. Wherein the coordinates of the designated position within the image area where the moving object is located in the moving image can be determined based on the image segmentation mask.

The image encoding module 114 is configured to encode a first frame image in the dynamic image to obtain a code stream of the encoded first frame image. It should be noted that the moving image sequence may be a sub-image sequence corresponding to a moving object in a moving image, or may be the moving image itself, and when the moving image sequence is a sub-image sequence corresponding to a moving object in a moving image, the first frame image needs to be encoded by the image encoding module 114. In the case where the moving image sequence is a moving image itself, the first frame image may not be encoded. At this time, the encoder 100 may not include the image encoding module 114.

The first video encoding module 115 is configured to encode the moving image sequence determined by the image sequence extraction module 112 to obtain a code stream of the encoded moving image sequence. The first code stream encapsulation module 116 is configured to encapsulate the code streams obtained by encoding by the position indication information encoding module 113, the image encoding module 114, and the first video encoding module 115, so as to obtain a combined code stream, and send the combined code stream to the output interface 140. The output interface 140 may send the merged code stream to the decoder 200.

It should be noted that, for the region segmentation coding mode, the present embodiment provides multiple implementations, and in different implementations, the encoder 100 may include all of the position indication information coding module 113, the image coding module 114, and the first video coding module 115, or may include some of the position indication information coding module 113, the image coding module 114, and the first video coding module 115.

The second video encoding module 117 is configured to encode the dynamic image in a video encoding manner, and obtain a code stream including the encoded dynamic image. The second code stream encapsulation module 118 is configured to encapsulate the code stream encoded by the second video encoding module 117, and send the encapsulated code stream to the output interface 140. The output interface 140 may send the merged codestream to the decoder 200.

It should be understood that the encoder 100 shown in fig. 3 is only one implementation provided for embodiments of the present application, and in other implementations, the encoder 100 may include more or fewer modules than those shown in fig. 3. The embodiment of the present application does not limit this.

Referring to fig. 4, fig. 4 is a schematic structural block diagram of a decoder 200 according to an embodiment of the present disclosure. The decoder 200 includes a decoding mode determination module 210, a position indication information decoding module 211, an image decoding module 212, a first video decoding module 213, an image synthesizing module 214, and a second video decoding module 215.

The decoding mode determining module 210 is configured to determine a decoding mode of the moving image, that is, to determine whether the moving image is decoded in the region segmentation decoding mode or in the video decoding mode. The region partition decoding mode refers to a decoding mode provided in the embodiments of the present application, and the video decoding mode refers to a conventional decoding mode. That is, in the case where a moving picture is encoded in accordance with the encoding mode provided in the embodiments of the present application, the moving picture can be decoded in accordance with the decoding mode provided in the embodiments of the present application, and in the case where a moving picture is encoded in accordance with the conventional encoding mode, the moving picture can be decoded in accordance with the conventional video decoding mode.

When a moving picture is decoded in the region division decoding mode, the decoder 200 includes a position indication information decoding module 211, an image decoding module 212, a first video decoding module 213, and an image synthesizing module 214. When the moving picture is decoded in the video decoding mode, the decoder 200 includes a second video decoding module 215.

The position indication information decoding module 211 is configured to decode a code stream including the encoded position indication information to obtain the position indication information. The position indication information may be an image segmentation mask, or may be coordinates of a specified position in the moving image in the image region where the moving object is located.

The image decoding module 212 is configured to parse the first frame image from the code stream. It should be noted that the motion image sequence may be a sub-image sequence corresponding to a motion object in a motion image, or may be the motion image itself, and in a case that the motion image sequence is a sub-image sequence corresponding to a motion object in a motion image, the code stream transmitted by the encoding terminal includes a code stream of an encoded first frame image, and at this time, the image decoding module 212 is configured to decode the code stream including the encoded first frame image, so as to obtain the first frame image. In the case where the moving image sequence is a moving image itself, the image decoding module 212 is configured to parse a first frame image from a code stream including an encoded moving image.

The first video decoding module 213 is configured to decode a code stream including an encoded moving image sequence, so as to obtain a moving image sequence. The moving image sequence may be a sub-image sequence corresponding to a moving object in a moving image, or may be the moving image itself, and the following embodiments will be described in cases, which will not be described in detail herein. The image combining module 214 is configured to combine the images decoded by the position indication information decoding module 211, the image decoding module 212, and the first video decoding module 213 to obtain a moving image, and transmit the moving image to the display device 220. The display device 220 may display a moving image.

It should be noted that, for the region partition decoding mode, the present embodiment provides a plurality of implementation manners, and in different implementation manners, the decoder 200 may include all of the position indication information decoding module 211, the image decoding module 212, and the first video decoding module 213, and may also include some of the position indication information decoding module 211, the image decoding module 212, and the first video decoding module 213.

The second video decoding module 215 is configured to decode a code stream including the decoded moving image to obtain a moving image. Thereafter, the moving image may be transmitted to the display device 220. The display device 220 may display a moving image.

It should be understood that the decoder 200 shown in fig. 4 is only one implementation provided for the embodiments of the present application, and in other implementations, the decoder 200 may include more or fewer modules than those shown in fig. 4. The embodiment of the present application does not limit this.

Next, a moving image encoding and decoding method provided in an embodiment of the present application will be described. It should be noted that, in conjunction with the implementation environment shown in fig. 1, any of the following methods for encoding a moving image may be performed by the encoder 100 in the source device 10. Taking fig. 2 as an example, any of the following methods for encoding a moving image may be performed by the cloud server 101 in fig. 2. Any of the following moving picture decoding methods may be performed by the decoder 200 in the destination device 20. Taking fig. 2 as an example, any of the following methods for decoding a moving image may be performed by the terminal device 201 in fig. 2.

In the encoding method of a dynamic image provided in the embodiment of the present application, semantic segmentation may be performed on any frame of image in the dynamic image to obtain an image segmentation mask, where the dynamic image includes a plurality of objects, and the image segmentation mask includes a plurality of image regions corresponding to the plurality of objects one to one. Based on the dynamic images, a moving image sequence is determined, each frame of image in the moving image sequence including an image area where one or more moving objects of the plurality of objects are located. Based on the image segmentation mask, position indication information is determined, the position indication information indicating a position of an image area in which the one or more moving objects are located. And coding the motion image sequence and the position indication information into a code stream.

In the method for decoding a dynamic image provided in the embodiment of the present application, a first frame image may be parsed from a code stream, and a dynamic image sequence and position indication information may be parsed from the code stream, where each frame image in the dynamic image sequence includes an image area where one or more moving objects are located, and the position indication information is used to indicate a position of the image area where the one or more moving objects are located. And rendering and displaying an image area where the one or more moving objects are positioned in the first frame image based on the moving image sequence and the position indication information to obtain a dynamic image.

It should be noted that the moving image sequence may include one or more sub-image sequences corresponding to the one or more moving objects one to one, or may be the moving image itself. The position indication information may be an image segmentation mask, or may be coordinates of a specified position of an image area where each of the one or more moving objects is located in the dynamic image. Therefore, the encoding and decoding method for moving pictures provided in the embodiments of the present application will be explained in detail below in a plurality of embodiments.

Referring to fig. 5, fig. 5 is a flowchart illustrating a first method for encoding a moving image according to an embodiment of the present disclosure. In the method, the moving image sequence includes one or more sub-image sequences, and the position indication information is an image division mask. The encoding method includes the following steps.

Step 501: and performing semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask, wherein the dynamic image comprises a plurality of objects, the image segmentation mask comprises a plurality of image areas in one-to-one correspondence with the objects, and the image segmentation mask is used for indicating the positions of the image areas where the one or more moving objects are located.

In addition, the moving image includes a plurality of objects that are generally divided into moving objects and still objects. A moving object is an object that has a change in itself, and may be referred to as an object in a moving state. For example, the water in a river in the moving image varies, and the five sense organs or limbs of the user vary, so the river and the user can be referred to as a moving object. The still object is an object having no variation, and may be referred to as an object in a still state, and for example, a grassland, a hill, and a sky in a moving image have no variation, and thus, the grassland, the hill, and the sky may be referred to as a still object.

Step 502: one or more sub-image sequences are extracted based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences are in one-to-one correspondence with one or more moving objects in the plurality of objects.

Therefore, in some embodiments, a moving object may be selected from the one or more moving objects, and the sequence of sub-images corresponding to the selected moving object may be determined according to the following operations until the sequence of sub-images corresponding to each moving object is determined: and determining a position area where the selected moving object is located based on the image segmentation mask, and extracting an image area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image based on the position area where the selected moving object is located to obtain a sub-image sequence corresponding to the selected moving object.

In general, the area formed by the contour of the moving object is an irregular area, that is, the position area where the moving object is located is not a regular area, and therefore, in some embodiments, the image area in the position area where the moving object is located can be directly extracted from each frame of image except the first frame of image in the dynamic image. Of course, in other embodiments, the position area where the moving object is located may also be processed into a regular area, and then the image area in the regular area is extracted from each frame of image in the dynamic image except the first frame of image.

That is, based on the position area of the selected moving object, the implementation process of extracting the image area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image includes: and extracting an image area located in the position area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image. Or expanding the position area where the selected moving object is located to enable the expanded position area to be a square area, and extracting an image area located in the expanded position area from each frame of image except the first frame of image in the dynamic image.

It should be noted that, various implementation manners for expanding the position area where the moving object is located include, for example, determining a minimum abscissa, a minimum ordinate, a maximum abscissa, and a maximum ordinate from a set of pixel coordinates corresponding to the selected moving object, then determining a square area whose abscissa is between the minimum abscissa and the maximum abscissa and whose ordinate is between the minimum ordinate and the maximum ordinate, and determining the square area as the expanded position area. Or, drawing a circumscribed square region of the position region directly based on the position region where the moving object is located, and determining the circumscribed square region as the expanded position region. The embodiment of the application does not limit the expansion mode, and the expanded position area only needs to contain the position area where the moving object is located.

For example, for the moving object K, the pixel value of the moving object K in the image segmentation mask is Mk. Scanning each pixel point in the image segmentation mask, and determining the coordinates of the pixel point with the pixel value of Mk, so as to obtain a pixel coordinate set: { (x) _k1 ，y _k1 )，(x _k2 ，y _k2 )，......，(x _kN ，y _kN ) And N is the number of pixel points with the pixel value Mk. At this time, the minimum abscissa min _ X can be determined _k Minimum ordinate min _ Y _k Maximum abscissa max _ X _k And maximum ordinate max _ Y _k I.e., min _ Xk = min { x { (x) } _k1 ，x _k2 ，......，x _kN }，min_Yk＝min{y _k1 ，y _k2 ，......，y _kN }，max_Xk＝max{x _k1 ，x _k2 ，......，x _kN }，max_Yk＝max{y _k1 ，y _k2 ，......，y _kN }. At this time, the set { (x, y) | min _ Xk may be set<＝x<＝max_Xk，min_Yk<＝x<The square area where the coordinates are located in = max _ Yk) } is determined as the motion pairLike the expanded location area corresponding to K. Then, an image area located within the expanded position area is extracted from each frame image other than the first frame image in the dynamic image.

Step 503: and coding the first frame image in the dynamic image, the one or more sub-image sequences and the image segmentation mask into a code stream.

For the first frame image in the dynamic image and the image segmentation mask, an image encoder can be adopted to encode the code stream. For each of the one or more sub-picture sequences, a video encoder may be employed to encode the codestream.

For ease of description, the image encoder employed for the first frame of image in the moving image is referred to as the first image encoder, the image encoder employed for the image segmentation mask is referred to as the second image encoder, and the video encoder employed for the sequence of one or more sub-images is referred to as the first video encoder. Wherein the first image encoder and the second image encoder may be the same or different.

In general, since the pixel values of the respective pixel points in the image region where the same object is located in the first frame of image in the moving image are different, and the pixel values of the respective pixel points in the image region where the same object is located in the image segmentation mask are the same, the first frame of image in the moving image may be encoded by using an image encoder having a higher encoding efficiency, and the image segmentation mask may be encoded by using a general image encoder.

It should be noted that the encoding side and the decoding side may agree on the first image encoder, the second image encoder, and the first video encoder in advance. Of course, the first image encoder, the second image encoder, and the first video encoder may also be selected by the user. In the case where the user selects the first image encoder, the second image encoder, and the first video encoder, the type of the first image encoder, the type of the second image encoder, and the type of the first video encoder also need to be coded into the code stream. And these image encoding and video encoders may be encoders included in the encoding side itself.

For each code stream obtained by the above coding, each code stream needs to be encapsulated to obtain a combined code stream, and then the combined code stream is transmitted to a decoding end.

The code streams may be encapsulated by adopting an international organization for standardization basic media file format (ISOBMFF) (ISO/IEC 14496-12-MPEG-4part 12), which is not limited in this embodiment of the present invention. Of course, the embodiment of the application can also extend the format of the HEIF (ISO/IEC 23008-12 standard) to encapsulate the above code streams.

For example, it is assumed that the embodiment of the present application adds a derivative image sequence, of type sovl, to a high efficiency image file format (HEIF) (ISO/IEC 23008-12 standard) format, where the derivative image sequence is obtained by superimposing one or more sub-image sequences on a first frame image. The one or more sub-image sequences and the first frame image are specified by a sequence reference box (sequence reference box). Wherein the one or more sub-image sequences are encapsulated in a track specified by the HEIF standard and the first frame image is encapsulated in an item specified by the HEIF standard.

The syntax of the derived image sequence is as follows:

wherein output _ width and output _ height are the width and height of the output derived image sequence.

The reference _ count is determined by the sequence reference box and indicates the number of the one or more sub-image sequences.

horizontal _ offset and vertical _ offset represent the offset of the sub-image sequence with respect to the upper left corner of the first frame image.

Wherein from _ track _ id represents the identity of the derived picture sequence, to _ item _ id represents the identity of the first frame picture, reference _ count represents the number of the one or more sub-picture sequences, and to _ track _ id represents the identity of the sub-picture sequence.

In the embodiment of the application, only the image area where the moving object is located in the dynamic image changes, the image area where the static object is located does not change, and the image segmentation mask is used for indicating the position of the image area where one or more moving objects are located, so that after the image area where each moving object is located is extracted from each frame of image except the first frame of image in the dynamic image, the first frame of image, the image segmentation mask and the image area where each moving object is located in each frame of image in the dynamic image are coded into the code stream, and the dynamic image can be decoded subsequently. That is, the image area where the moving object is located in the dynamic image and the image area where the static object is located are divided, and then the image area where the moving object is located is coded into the code stream, so that the image area where the static object is located does not need to be coded into the code stream, and the coding efficiency is improved. In addition, the encoder included in the encoder can be directly multiplexed, and only the code streams obtained by encoding need to be encapsulated, so that the corresponding encoder does not need to be designed independently.

Referring to fig. 6, fig. 6 is a flowchart of a first method for decoding a moving image according to an embodiment of the present application, where the decoding method corresponds to the encoding method shown in fig. 5. The decoding method includes the following steps.

Step 601: and analyzing the first frame image from the code stream.

Based on the above description, the image encoder employed for the first frame image is referred to as a first image encoder, and for convenience of description, the image decoder employed for the first frame image may also be referred to as a first image decoder.

Since the first image encoder can be pre-agreed for the encoding side and the decoding side, it can also be selected by the user during the encoding process. Therefore, under the condition that the first image encoder is agreed in advance for the encoding end and the decoding end, the first image decoder is also agreed in advance for the encoding end and the decoding end, and at the moment, the first frame image can be directly analyzed from the code stream according to the agreed first image decoder. Under the condition that the first image encoder is selected by a user, the type of the first image encoder needs to be analyzed from the code stream, then the first image decoder is determined based on the type of the first image encoder, and then the first frame image is analyzed from the code stream according to the determined first image decoder.

Step 602: one or more sub-image sequences and an image segmentation mask are analyzed from the code stream, the image segmentation mask comprises a plurality of image areas which are in one-to-one correspondence with a plurality of objects, the one or more sub-image sequences are in one-to-one correspondence with one or more moving objects which are included in the plurality of objects, and the image segmentation mask is used for indicating the positions of the image areas where the one or more moving objects are located.

Based on the above description, the image encoder employed by the image segmentation mask is referred to as a second image encoder, and for convenience of description, the image decoder employed by the image segmentation mask may also be referred to as a second image decoder. Similarly, the video decoder employed by the one or more sub-picture sequences is referred to as the first video decoder.

Since the second image encoder can be pre-agreed for the encoding side and the decoding side, it can also be user-selected during the encoding process. Therefore, under the condition that the second image encoder is agreed in advance for the encoding end and the decoding end, the second image decoder is also agreed in advance for the encoding end and the decoding end, and at the moment, the image segmentation mask can be directly analyzed from the code stream according to the agreed second image decoder. Under the condition that the second image encoder is selected by a user, the type of the second image encoder needs to be analyzed from the code stream, then the second image decoder is determined based on the type of the second image encoder, and then the image segmentation mask is analyzed from the code stream according to the determined second image decoder.

Similarly, the first video encoder may be pre-agreed for the encoding end and the decoding end, or may be selected by the user during the encoding process. Therefore, under the condition that the first video encoder is agreed in advance for the encoding end and the decoding end, the first video decoder is also agreed in advance for the encoding end and the decoding end, and at this time, each sub-image sequence in the one or more sub-image sequences can be directly analyzed from the code stream according to the agreed first video decoder. Under the condition that the first video encoder is selected by a user, the type of the first video encoder needs to be analyzed from the code stream, then a corresponding first video decoder is determined based on the type of the first video encoder, and then each sub-image sequence in the one or more sub-image sequences is analyzed from the code stream according to the determined first video decoder.

Step 603: and rendering and displaying the image area where the one or more moving objects are positioned in the first frame image based on the one or more sub-image sequences and the image segmentation mask to obtain a dynamic image.

The process of rendering and displaying the image area where each moving object is located in the first frame image is the same, so in some embodiments, one moving object may be selected from the one or more moving objects, and the image area where the selected moving object is located may be rendered and displayed according to the following operations until the image area where each moving object is located is rendered and displayed: based on the image segmentation mask, the position of the image region where the selected moving object is located is determined. And according to the position of the image area where the selected moving object is located, rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image.

In general, the area formed by the contour of the moving object is an irregular area, that is, the location area formed by the set of pixel coordinates is not a regular area, and therefore, in some embodiments, the location area formed by the set of pixel coordinates corresponding to the moving object may be directly determined as the location of the image area where the moving object is located in the dynamic image. Of course, in other embodiments, the position area formed by the pixel coordinate set may also be processed as a regular area, and then the position of the regular area is determined as the position of the image area where the moving object is located in the dynamic image.

It should be noted that, for the implementation process of performing the expansion processing on the location area formed by the pixel coordinate set, reference may be made to the related description in step 502, and details of this implementation are not described herein again.

In addition, the rendering order of the image areas where the one or more moving objects are located is consistent with the order of the code stream of the image areas in the whole code stream.

Under the condition that each code stream is encapsulated by expanding the format of an HEIF (ISO/IEC 23008-12 standard) at an encoding end, the embodiment of the application can acquire the code stream of a first frame image through a to _ item _ id, further decode the code stream to obtain a first frame image, acquire the code stream of each sub-image sequence according to the to _ track _ id, further decode the code stream to obtain a sub-image sequence, and then superimpose the sub-image sequence or the sub-image sequences on the first frame image according to a horizontal _ offset and a vertical _ offset and according to the sequence analyzed by the to _ track _ id to obtain a reconstructed image of a derived image sequence, namely a reconstructed dynamic image.

The one or more moving objects mentioned in steps 601-603 above may be all moving objects in a plurality of objects included in the dynamic image. Of course, the one or more moving objects may also be part of the moving objects in the plurality of objects. That is, for moving objects in a moving image, it can be determined at the decoding end whether all the moving objects are in a moving state or a part of the moving objects need to be screened out again.

In this embodiment of the application, since the image segmentation mask is used to indicate positions of the image areas where the one or more moving objects are located in the dynamic image, after the first frame image is parsed from the code stream, the image areas where the one or more moving objects are located in the first frame image may be rendered and displayed according to the position of the image area where each moving object is located in the dynamic image. That is, when decoding a moving image, after the first frame image is decoded, only the image area where the moving object is located needs to be decoded for the subsequent image, and the image area where the still object is located does not need to be decoded, thereby effectively reducing the decoding complexity and power consumption. In addition, in the display process of the dynamic image, the image area where the moving object is located is rendered and refreshed to be displayed only on the basis of the first frame image, so that the display power consumption is effectively reduced.

Next, with reference to fig. 7, an exemplary description will be given of a method for encoding and decoding a moving image according to the embodiments shown in fig. 5 and 6.

And a coding end step:

1) The user selects the encoder to use and encodes the following syntax elements at the system level to indicate:

image _ codec _ type: an image encoder type for encoding the first frame image, for example, image _ codec _ type may take 0 or 1,0 represents Joint Photographic Experts Group (JPEG), and 1 represents Portable Network graphics Format (PNG); other types of encoders, such as Better Portable Graphics (BPG), may also be indicated, without limitation.

mask _ codec _ type: image encoder types that encode image segmentation masks, such as JPEG or PNG; other types of encoders, such as BPG, may also be indicated, without limitation.

video _ codec _ type: a type of video encoder that encodes a sequence of sub-images, such as h.265. Other types of encoders, such as h.264, may also be indicated, without limitation.

2) Calling a corresponding encoder according to the image _ codec _ type to encode the first frame image, wherein the encoding of the first frame image can use an efficient image encoder;

3) Calling a corresponding encoder according to the mask _ codec _ type to encode an image segmentation mask, wherein the encoding of the image segmentation mask can use a common image encoder;

4) And extracting image areas where the moving objects are located from the images except the first frame image in the dynamic image by using the image segmentation mask to form a plurality of sub video sequences. The value of object K in mask is denoted as Mk. The image area where the object K is located is extracted as follows:

circulating each moving object, scanning the image segmentation mask line by line on the assumption that the image region where the current extraction object K is located is, recording coordinates of which the pixel value is Mk in the image segmentation mask, and forming a set: { (xk, 1, yk, 1), (xk, 2, yk, 2), \8230 { (xk, 1, yk, 2) }, (xk, N, yk, N) }, wherein N is the number of coordinate points;

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝min{yk,1,yk,2,…,yk,N}

the position of the coordinate in the set { (x, y) | min _ Xk < = x < = max _ Xk, min _ Yk < = x < = max _ Yk) } is the position of the object K.

And extracting a square area of the position of the object K in the image except the first frame image in the dynamic image to be used as a sub-image sequence.

5) Calling a corresponding video encoder according to the video _ codec _ type to encode each sub-image sequence;

6) And splicing, packaging (and transmitting) the code stream obtained in the step according to the ISOBMFF (ISO/IEC 14496-12-MPEG-4part 12) standard.

A decoding end step:

1) The following information is decoded at the system layer:

image_codec_type

mask_codec_type

video_codec_type

2) Calling a corresponding image decoder to decode and display the first frame image according to the image _ codec _ type;

3) Calling a corresponding decoder to decode the image segmentation mask according to the mask _ codec _ type;

4) The position of each moving object in the image is determined using the image segmentation mask. The pixel value of the object K in the image segmentation mask is denoted Mk. The moving object position is determined as follows:

and circulating each moving object, scanning the image segmentation mask line by line and recording the coordinates of which the pixel value is Mk on the assumption that the position of the object K is determined currently. These coordinates constitute the set: { (xk, 1, yk, 1), (xk, 2, yk, 2), \ 8230 { (xk, N, yk, N) };

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝max{yk,1,yk,2,…,yk,N}

the position of the object in the set { (x, y) | min _ Xk < = x < = max _ Xk, min _ Yk < = y < = max _ Yk) } is the position of the object K.

5) Calling a corresponding decoder to decode the sub-image sequence according to the video _ codec _ type;

6) And rendering and refreshing the display of each moving object at the corresponding position. The rendering sequence of the objects is consistent with the sequence of the code stream of the objects in the whole code stream.

The interactivity of the user can be increased on the basis of fig. 7, that is, the user at the decoding end can choose to make a specific object move, and other areas remain still. Next, with reference to fig. 8, an exemplary description will be given of a method for encoding and decoding a moving image according to the embodiments shown in fig. 5 and 6.

The encoding end step:

1) The user selects the type of encoder to use (e.g., h.265 encoder) and encodes at the system level the following syntax elements to indicate:

image _ codec _ type: an image encoder type for encoding the first frame image, for example, image _ codec _ type may take 0 or 1,0 represents JPEG,1 represents PNG; other types of encoders, such as BPG, may also be indicated, without limitation.

video _ codec _ type: video encoder types that encode sequences of sub-images or the dynamic images themselves, such as h.265. Other types of encoders, such as h.264, may also be indicated, without limitation.

4) And extracting image areas where the moving objects are located from the images except the first frame image in the dynamic image by using the image segmentation mask to form a plurality of sub video sequences. The value of object K in mask is denoted as Mk. The image region in which the object K is located is extracted as follows:

circulating each moving object, scanning the image segmentation mask line by line on the assumption that the image region where the current extraction object K is located, recording coordinates of which the pixel value is Mk in the image segmentation mask, and forming a set: { (xk, 1, yk, 1), (xk, 2, yk, 2) \ 8230 { (xk, 1, yk, 1), (xk, 2, yk, N) }, wherein N is the number of coordinate points;

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝min{yk,1,yk,2,…,yk,N}

A decoding end step:

1) The system layer decodes the following information:

image_codec_type；

mask_codec_type；

video_codec_type。

2) Selecting a corresponding image decoder to decode and display the first frame image according to the image _ codec _ type;

3) Clicking a corresponding position in the first frame image by a user, and selecting an object to which the position belongs to move; or choose to have all objects in motion;

4) Calling a corresponding decoder to decode the image segmentation mask according to the mask _ codec _ type;

5) And (3) determining the coordinate range of each moving object by using the scheme in the step 6) according to the clicked position (x, y) of the user, and determining the position range and the code stream index of the moving object to which the current moving object belongs according to the current clicked position. Decoding the subcode stream of the selected moving object to obtain the reconstruction of the selected moving object;

6) The position of the moving object in the image is determined using the image segmentation mask. The pixel value of the object K in the image segmentation mask is denoted Mk. The specific object position determination method is as follows:

assuming that the position of the object K is currently determined, the image segmentation mask is scanned line by line, and coordinates with a pixel value Mk in the image segmentation mask are recorded, and the coordinates form a set: { (xk, 1, yk, 1), (xk, 2, yk, 2) \ 8230 { (xk, 1, yk, 1), (xk, 2, yk, N) };

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝max{yk,1,yk,2,…,yk,N}

7) And rendering and refreshing the display of each moving object at the corresponding position. The rendering sequence of the objects is consistent with the sequence of the code stream of the objects in the whole code stream.

Referring to fig. 9, fig. 9 is a flowchart of a second method for encoding a moving image according to an embodiment of the present disclosure. In the method, the moving image sequence includes one or more sub-image sequences, and the position indication information includes coordinates of one or more specified positions. The encoding method includes the following steps.

Step 901: and performing semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask, wherein the dynamic image comprises a plurality of objects, and the image segmentation mask comprises a plurality of image areas which correspond to the objects one to one.

The content of step 901 may refer to the related description in step 501, which is not described again in this embodiment of the present application.

Step 902: one or more sub-image sequences are extracted based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences are in one-to-one correspondence with one or more moving objects in the plurality of objects.

The content of step 902 may refer to the related description in step 502, which is not described herein again in this embodiment of the application.

Step 903: and determining the coordinates of the specified position in the dynamic image in the image area where each of the one or more moving objects is located to obtain the coordinates of the one or more specified positions.

Since the position of the image area where each moving object is located has been determined in the process of extracting the sub-image sequence in step 902, that is, the position area where each moving object is located, or the square area where the position area where each moving object is located is expanded. Therefore, the coordinates of the specified position within the image area where each moving object is located can be directly determined.

The designated position in the image region where the moving object is located may be a position with the minimum coordinates, a position with the maximum coordinates, or a position of a geometric center point. Of course, other positions may be adopted, which is not limited in the embodiments of the present application.

Step 904: and coding the first frame image in the dynamic image, the one or more sub-image sequences and the one or more coordinates of the specified positions into a code stream.

Optionally, the number of the one or more moving objects may also be coded into the code stream in the embodiment of the present application. In this way, for the decoding end, it can be determined whether the sub-image sequence with transmission failure exists in the one or more sub-image sequences based on the number of the one or more moving objects, so as to ensure the reliability of dynamic image decoding.

For other contents in step 904, reference may be made to the relevant description in step 503, which is not described again in this embodiment of the present application.

In the embodiment of the present application, only the image area where the moving object is located in the dynamic image may change, and the image area where the static object is located may not change, so after the image area where each moving object is located is extracted from each frame of image in the dynamic image except the first frame of image, and the coordinates of the specified position in the image area where each moving object is located in the dynamic image are determined, the image area where each moving object is located in each frame of image in the dynamic image, and the coordinates of the specified position in the image area where each moving object is located in the dynamic image are encoded into the code stream, and the dynamic image may be decoded subsequently. That is, the image area where the moving object is located in the dynamic image and the image area where the static object is located are divided, and then the image area where the moving object is located is coded into the code stream, so that the image area where the static object is located does not need to be coded into the code stream, and the coding efficiency is improved. In addition, the encoder included in the encoder can be directly multiplexed, and only the code streams obtained by encoding need to be encapsulated, so that the corresponding encoder does not need to be designed independently.

Referring to fig. 10, fig. 10 is a flowchart illustrating a second method for decoding a moving image according to an embodiment of the present application, where the decoding method corresponds to the encoding method shown in fig. 9. The decoding method includes the following steps.

Step 1001: and analyzing the first frame image from the code stream.

The content in step 1001 may refer to the related description in step 601, which is not described again in this embodiment of the present application.

Step 1002: one or more sub-image sequences and one or more coordinates of specified positions are analyzed from the code stream, the one or more sub-image sequences correspond to one or more moving objects one to one, the one or more coordinates of the specified positions correspond to the one or more moving objects one to one, and the specified positions refer to the specified positions in the image area where the corresponding moving objects are located.

The content in step 1002 may refer to the related description in step 602, which is not described herein again in this embodiment of the application.

Step 1003: and rendering and displaying an image area where the one or more moving objects are positioned in the first frame image based on the one or more sub-image sequences and the coordinates of the one or more specified positions to obtain a dynamic image.

The rendering and displaying process of the image area where each moving object is located in the first frame image is the same, so in some embodiments, one moving object may be selected from the one or more moving objects, and the image area where the selected moving object is located may be rendered and displayed according to the following operations until the image area where each moving object is located is rendered and displayed: and rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image according to the coordinate of the designated position of the image area in which the selected moving object is positioned in the dynamic image.

Since the encoding end directly encodes the coordinates of the designated position in the dynamic image in the image area where each moving object is located into the code stream, in the embodiment of the application, after the coordinates of the designated position corresponding to the selected moving object are analyzed from the code stream, the image area included in the sub-image sequence corresponding to the selected moving object can be directly rendered and displayed in the first frame image, and the reconstruction speed of the image is improved.

Under the condition that the coding end codes the number of the one or more moving objects into the code stream, the embodiment of the application can also analyze the number of the one or more moving objects from the code stream. In this way, by comparing the number of the one or more moving objects with the number of the one or more sub-image sequences, it can be determined whether the one or more sub-image sequences have the sub-image sequence with failed transmission, thereby improving the reliability of dynamic image decoding.

The one or more moving objects mentioned in the above steps 1001 to 1003 may be all moving objects among a plurality of objects included in the moving image. Of course, the one or more moving objects may also be part of the moving objects in the plurality of objects. That is, for moving objects in a moving image, it can be determined at the decoding end whether all the moving objects are in a moving state or a part of the moving objects need to be screened out again.

The object selection instruction may be triggered by a user based on the first frame image, for example, all moving objects in the dynamic image are marked in the first frame image, the user may select a part or all of the objects in all moving objects in the first frame image, and the selected object is one or more moving objects in the foregoing steps.

Other contents in step 1003 may refer to the related description in step 603, and this is not described again in this embodiment of the present application.

In this embodiment of the present application, after the first frame image is analyzed from the code stream, the image areas where the one or more moving objects are located may be rendered and displayed in the first frame image according to the coordinates of the specified position in the dynamic image in the image area where each moving object is located. That is, when decoding a moving image, after the first frame image is decoded, only the image area where the moving object is located needs to be decoded for the subsequent image, and the image area where the still object is located does not need to be decoded, thereby effectively reducing the decoding complexity and power consumption. In addition, in the display process of the dynamic image, the image area where the moving object is located is rendered and refreshed to be displayed only on the basis of the first frame image, so that the display power consumption is effectively reduced.

Next, with reference to fig. 11, an exemplary description will be given of a method for encoding and decoding a moving image according to the embodiments shown in fig. 9 and 10. In this embodiment, it is only necessary to encode the start position of each moving object without encoding an image segmentation mask.

The encoding end step:

1) The position of each moving object in the image is determined using an image segmentation mask. The pixel value of the object K in the image segmentation mask is denoted Mk, and the number of objects num _ sub _ sequences is set to 0. The specific determination method of the position of the moving object is as follows:

each moving object is looped, assuming that object K is currently extracted.

Scanning an image segmentation mask line by line, recording coordinates of a pixel value Mk in the image segmentation mask, wherein the coordinates form a set: { (xk, 1, yk, 1), (xk, 2, yk, 2), \ 8230 { (xk, N, yk, N) };

if the coordinate set is not empty, num _ sub _ sequences = num _ sub _ sequences +1;

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝max{yk,1,yk,2,…,yk,N}

the region represented by the set { (x, y) | min _ Xk < = x < = max _ Xk, min _ Yk < = y < = max _ Yk) } is the region where the object K is located. Extracting a square area where the object of the area is located to obtain a sub-image sequence;

the width of the sub-image sequence is max _ Xk-min _ Xk, and the height of the sub-image sequence is max _ Yk-min _ Yk;

adding min _ Xk into position _ top _ left _ x _ list;

min _ Yk is added to position _ top _ left _ y _ list.

2) The system layer coding indicates the following information:

image _ codec _ type: an image encoder type for encoding the first frame image, for example, image _ codec _ type may take 0 or 1,0 for JPEG,1 for PNG; other types of encoders, such as BPG, may also be indicated, without limitation.

video _ codec _ type: a video encoder of the type that encodes sub-image sequences or the moving images themselves, for example h.265. Other types of encoders, such as h.264, may also be indicated, without limitation.

num _ sub _ sequences: number of subimage sequences

position _ top _ left _ x _ list: top left corner horizontal position coordinates List

position _ top _ left _ y _ list: top left corner vertical position coordinate list

3) Calling a corresponding encoder according to the image _ codec _ type to encode the first frame image, wherein the encoding of the first frame image can use an efficient image encoder;

4) Calling a corresponding video encoder according to the video _ codec _ type to encode each sub-image sequence;

5) And splicing, packaging (and transmitting) the code stream obtained in the step according to the ISOBMFF (ISO/IEC 14496-12-MPEG-4part 12) standard.

A decoding end step:

1) The following information is decoded at the system layer:

image_codec_type

video_codec_type

num_sub_sequences

position_top_left_x_list

position_top_left_y_list

3) And selecting a corresponding video decoder to decode the sub-image sequence according to the video _ codec _ type. Specifically, each sub-image j in the K frame image is processed:

decoding the corresponding code stream to obtain the object reconstruction;

obtaining the position of the leftmost upper corner of the object: position _ top _ left _ x _ list [ j ], position _ top _ left _ y _ list [ j ];

4) And rendering and refreshing the display of each moving object at the corresponding position. The rendering sequence of the objects is consistent with the sequence of the code stream of the objects in the whole code stream.

Optionally, the embodiment of the application may extend a format of HEIF (ISO/IEC 23008-12 standard) to encapsulate the above respective code streams. For example, a derivative image sequence of type sovl is added to the format of HEIF (ISO/IEC 23008-12), indicating that the derivative image sequence is obtained by superimposing one or more sub-image sequences on the first frame image. The one or more sub-image sequences and the first frame image are specified by a sequence reference box (sequence reference box). Wherein the one or more sub-image sequences are encapsulated in a track specified by the HEIF standard and the first frame image is encapsulated in an item specified by the HEIF standard.

The syntax of the derived image sequence is as follows:

horizontal _ offset and vertical _ offset represent offsets of sub-image sequences with respect to the upper left corner of the first frame image.

Wherein from _ track _ id represents the identity of the derived image sequence, to _ item _ id represents the identity of the first frame image, reference _ count represents the number of the one or more sub-image sequences, and to _ track _ id represents the identity of the sub-image sequence.

The interactivity of the user can be increased on the basis of fig. 11, that is, the user at the decoding end can choose to make a specific object move, and the other areas remain still. Next, with reference to fig. 12, an exemplary description will be given of a method for encoding and decoding a moving image according to the embodiments shown in fig. 9 and 10.

The encoding end step:

1) The position of each moving object in the image is determined using the image segmentation mask. The pixel value of the object K in the image segmentation mask is denoted Mk, and the number of objects num _ sub _ sequences is set to 0. The specific determination method of the position of the moving object is as follows:

each moving object is looped, assuming that object K is currently extracted.

Scanning an image segmentation mask line by line, recording coordinates of a pixel value Mk in the image segmentation mask, wherein the coordinates form a set: { (xk, 1, yk, 1), (xk, 2, yk, 2) \ 8230 { (xk, 1, yk, 1), (xk, 2, yk, N) };

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝max{yk,1,yk,2,…,yk,N}

the area represented by the set { (x, y) | min _ Xk < = x < = max _ Xk, min _ Yk < = y < = max _ Yk) } is the area where the object K is located. Extracting a square area where the object of the area is located to obtain a sub-image sequence;

adding min _ Xk into position _ top _ left _ x _ list;

min _ Yk is added to position _ top _ left _ y _ list.

2) The system layer coding indicates the following information:

num _ sub _ sequences: number of subimage sequences

position _ top _ left _ x _ list: top left horizontal position coordinate list

5) And splicing and packaging (and transmitting) the code stream obtained in the step according to the ISOBMFF (ISO/IEC 14496-12-MPEG-4part 12) standard.

A decoding end step:

1) The following information is decoded at the system layer:

image_codec_type

video_codec_type

num_sub_sequences

position_top_left_x_list

position_top_left_y_list

3) A user issues an instruction to select a specific object to move or all objects to move;

4) Decoding the subcode stream of the selected object according to a command signal issued by a user to obtain the reconstruction of the selected object;

5) And selecting a corresponding video decoder to decode the sub-image sequence according to the video _ codec _ type. Specifically, a sub-image j corresponding to a target object in the Kth frame image is processed:

decoding the corresponding code stream to obtain the object reconstruction;

6) And rendering and displaying and refreshing each moving object at the corresponding position. The rendering sequence of the objects is consistent with the sequence of the code stream of the objects in the whole code stream.

Optionally, the embodiment of the application may extend the format of HEIF (ISO/IEC 23008-12 standard), add syntax of a derivative image sequence, and encapsulate the first frame image and the sub-image sequence. Reference is made in particular to the above description.

Referring to fig. 13, fig. 13 is a flowchart of a third method for encoding a moving image according to an embodiment of the present disclosure. In this method, the moving image sequence is a moving image, and the position indication information is an image division mask. The encoding method includes the following steps.

Step 1301: and performing semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask, wherein the dynamic image comprises a plurality of objects, and the image segmentation mask comprises a plurality of image areas which correspond to the objects one to one.

The content of step 1301 may refer to the related description in step 501, and this is not described again in this embodiment of the present application.

Step 1302: and coding the image segmentation mask and the dynamic image into a code stream, wherein the image segmentation mask is used for indicating the position of an image area where the one or more moving objects are located.

For image segmentation masks, an image encoder may be used to encode the codestream. For the dynamic image, a video encoder may be used to encode the bitstream. For convenience of description, the image encoder used for the image segmentation mask is referred to as a second image encoder, and the video encoder used for the moving image is referred to as a second video encoder. Wherein the second video encoder may be the same as or different from the first video encoder described above.

It should be noted that the encoding side and the decoding side may agree on the second image encoder and the second video encoder in advance. Of course, the second image encoder and the second video encoder may also be selected by the user. When the user selects the second image encoder and the second video encoder, the type of the second image encoder and the type of the second video encoder also need to be coded into the code stream. And these image encoding and video encoders may be encoders included in the encoding side itself.

For each code stream obtained by the above coding, each code stream needs to be encapsulated to obtain a combined code stream, and then the combined code stream is transmitted to a decoding end. The code streams may be encapsulated by using the ISOBMFF (ISO/IEC 14496-12-MPEG-4part 12) standard in the embodiment of the present application, which is not limited in the embodiment of the present application.

In the embodiment of the application, only the image area where the moving object is located in the dynamic image is changed, the image area where the static object is located is not changed, and the image segmentation mask is used for indicating the position of the image area where one or more moving objects are located, so that the image segmentation mask and the whole dynamic image are coded into the code stream, and after subsequent decoding, the image area where the moving object is located can be extracted from the dynamic image based on the image segmentation mask, and then the image area where the static object is located is rendered and displayed on the basis of the first frame image without rendering and displaying the image area where the static object is located again, and the display power consumption is reduced. In addition, the encoder included in the encoder can be directly multiplexed, and only the code streams obtained by encoding need to be encapsulated, so that the corresponding encoder does not need to be designed independently.

Referring to fig. 14, fig. 14 is a flowchart of a third method for decoding a moving image according to an embodiment of the present disclosure, where the decoding method corresponds to the encoding method shown in fig. 13. The decoding method includes the following steps.

Step 1401: and analyzing the first frame image from the code stream.

The content in step 1401 may refer to the related description in step 601, and is not described again in this embodiment of the present application.

Step 1402: and analyzing an image segmentation mask and a dynamic image from the code stream, wherein the image segmentation mask comprises a plurality of image areas which are in one-to-one correspondence with a plurality of objects, the plurality of objects comprise the one or more moving objects, and the image segmentation mask is used for indicating the positions of the image areas where the one or more moving objects are located.

The implementation process of analyzing the image segmentation mask from the code stream may refer to the relevant description in step 602, which is not described in detail in this embodiment of the present application.

Based on the above description, the video encoder employed for the moving picture is referred to as a second video encoder, and for convenience of description, the video decoder employed for the moving picture may also be referred to as a second video decoder.

Since the second video encoder can be pre-agreed for the encoding side and the decoding side, it can also be selected by the user during the encoding process. Therefore, under the condition that the second video encoder is agreed in advance for the encoding end and the decoding end, the second video decoder is also agreed in advance for the encoding end and the decoding end, and at the moment, the dynamic image can be directly analyzed from the code stream according to the agreed second video decoder. Under the condition that the second video encoder is selected by a user, the type of the second video encoder needs to be analyzed from the code stream, then the second video decoder is determined based on the type of the second video encoder, and then the dynamic image is analyzed from the code stream according to the determined second video decoder.

Step 1403: and rendering and displaying an image area where the one or more moving objects are located in the first frame image based on the image segmentation mask and the dynamic image to obtain the dynamic image.

The process of rendering and displaying the image area where each moving object is located in the first frame image is the same, so in some embodiments, one moving object may be selected from the one or more moving objects, and the image area where the selected moving object is located may be rendered and displayed according to the following operations until the image area where each moving object is located is rendered and displayed: and determining the position of the image area where the selected moving object is positioned based on the image segmentation mask, and extracting the image area where the selected moving object is positioned from each frame of image except the first frame of image in the dynamic image based on the position of the image area where the selected moving object is positioned. And according to the position of the image area where the selected moving object is positioned, rendering and displaying the image area where the selected moving object is positioned in each frame of image of the dynamic image in the first frame of image.

The implementation process of determining the position of the image region where the selected moving object is located based on the image segmentation mask may refer to the related description in step 603, which is not described in detail in this embodiment of the present application. The implementation process of extracting the image area where the selected moving object is located from each frame of image in the dynamic image except the first frame of image based on the position of the image area where the selected moving object is located may refer to the related description in step 502 above, and details of this embodiment of the present application are also omitted here.

It should be noted that the rendering order of the image areas where the one or more moving objects are located is consistent with the order of the codestream of the image areas in the whole codestream.

The one or more moving objects mentioned in the above steps 1401 to 1403 may be all moving objects among a plurality of objects included in the moving image. Of course, the one or more moving objects may also be part of the moving objects in the plurality of objects. That is, for moving objects in a moving image, it can be determined at the decoding end whether all the moving objects are in a moving state or a part of the moving objects need to be screened out again.

In this embodiment of the application, since the image segmentation mask is used to indicate positions of image areas where the one or more moving objects are located in the moving image, after the first frame image is parsed from the code stream, according to positions of the image areas where each moving object is located in the moving image, the image areas where each moving object is located may be extracted from each frame image in the moving image except the first frame image, and then the image areas where the one or more moving objects are located are rendered and displayed in the first frame image. That is, in the process of displaying the dynamic image, the image area where the moving object is located is rendered and displayed in a refreshing manner only on the basis of the first frame image, so that the power consumption of the display is effectively reduced.

The encoding and decoding method for moving images provided by the embodiments shown in fig. 13 and 14 described above is exemplified. In this embodiment, the user may choose to have a particular object in motion while other areas remain stationary.

And a coding end step:

1) The user selects the type of encoder to use and encodes the following syntax elements at the system level to indicate:

2) Calling a corresponding encoder according to the mask _ codec _ type to encode an image segmentation mask, wherein the encoding of the image segmentation mask can use a common image encoder;

3) The dynamic image is regarded as a complete video, and a corresponding video encoder is called according to the video _ codec _ type to encode the dynamic image;

4) And splicing, packaging (and transmitting) the code stream obtained in the step according to the ISOBMFF (ISO/IEC 14496-12-MPEG-4part 12) standard. Or transmitting the coded code stream of the image segmentation mask through an SEI message.

A decoding end step:

1) The system layer decodes the following information:

mask_codec_type；

video_codec_type。

2) Selecting a corresponding decoder to decode the image segmentation mask according to the mask _ codec _ type;

3) Selecting a corresponding decoder to decode and reconstruct the motion image according to the video _ codec _ type;

4) A user issues an instruction to select a specific object to move or all objects to move;

5) And rendering, displaying and refreshing the appointed object at the corresponding position according to a command signal issued by the user.

Referring to fig. 15, fig. 15 is a flowchart of a fourth method for encoding a moving image according to an embodiment of the present application. In the method, the sequence of moving objects includes dynamic images, and the position indication information includes an image segmentation mask. The encoding method includes the following steps.

Step 1501: and performing semantic segmentation on any frame of image in the dynamic image to obtain an image segmentation mask, wherein the dynamic image comprises a plurality of objects, and the image segmentation mask comprises a plurality of image areas which correspond to the objects one by one.

The content of step 1501 may refer to the related description in step 501, which is not described again in this embodiment of the present application.

Step 1502: based on the image segmentation mask, a plurality of segmentation regions in one-to-one correspondence with the plurality of objects is determined.

In some embodiments, a location area in which each of the plurality of objects is located may be determined based on the image segmentation mask. When the position area of any object in the plurality of objects does not include an integer number of CTUs, the boundary of the position area of any object is expanded so that the position area of any object includes an integer number of CTUs. And determining the position areas where the plurality of objects are positioned after the expansion processing as the plurality of segmentation areas.

Step 1503: according to the plurality of divided areas, each frame image except the first frame image in the dynamic image is subjected to area division to obtain a plurality of image areas.

Since the plurality of divided regions are in one-to-one correspondence with the plurality of objects, after each frame of image except the first frame of image in the dynamic image is subjected to region division, each frame of image includes image regions in one-to-one correspondence with the plurality of divided regions, that is, image regions in one-to-one correspondence with the plurality of objects.

Step 1504: and determining the object state corresponding to each of the plurality of segmented regions, wherein the object state comprises a static state or a motion state.

Since one divided region corresponds to one object, the state in which the object corresponding to each divided region is located can be determined as the object state corresponding to the corresponding divided region.

Step 1505: and encoding a first frame image in the dynamic image, the plurality of image areas, the object state corresponding to each of the plurality of segmentation areas and an image segmentation mask into the code stream, wherein the image segmentation mask is used for indicating the positions of the image areas where the one or more moving objects are located.

The content of encoding the first frame image and the image segmentation mask in the dynamic image may refer to the related description in step 503, which is not described in detail in this embodiment of the present application.

For the plurality of image areas, the implementation process of encoding the plurality of image areas into the code stream comprises the following steps: and coding each image area in the plurality of image areas into the code stream as a coding block. Or coding an area formed by each row of CTUs in each image area in the plurality of image areas into the code stream as a coding block. Wherein, the position area of the reference coding block is positioned in the position area of the referenced coding block.

For each code stream obtained by the above coding, each code stream needs to be encapsulated to obtain a combined code stream, and then the combined code stream is transmitted to a decoding end. In the embodiment of the present application, the above code streams may be encapsulated by using an ISOBMFF (ISO/IEC 14496-12-MPEG-4part 12) standard, which is not limited in the embodiment of the present application.

In the embodiment of the application, only the image area where the moving object is located in the dynamic image can be changed, the image area where the static object is located can not be changed, and after each frame of image except the first frame of image in the dynamic image is subjected to area division through a plurality of divided areas, the first frame of image, the plurality of divided image areas, the object state corresponding to each divided area and the image dividing mask are coded into the code stream, the dynamic image can be decoded subsequently. That is, the image area where the moving object is located and the image area where the static object is located in the dynamic image are divided, and then the image area where the moving object is located and the image area where the static object is located are separately coded into the code stream, so that only the image area corresponding to the moving state needs to be decoded in the subsequent decoding, the image area corresponding to the static state does not need to be decoded, and the decoding efficiency is improved. In addition, the encoder included in the encoder can be directly multiplexed, and only the code streams obtained by encoding need to be encapsulated, so that the corresponding encoder does not need to be designed independently.

Referring to fig. 16, fig. 16 is a flowchart of a fourth decoding method for a moving picture according to an embodiment of the present application, where the decoding method corresponds to the encoding method shown in fig. 15. The decoding method includes the following steps.

Step 1601: and analyzing the first frame image from the code stream.

The content in step 1601 may refer to the relevant description in step 601, which is not described again in this embodiment of the application.

Step 1602: and analyzing an image segmentation mask from the code stream, wherein the image segmentation mask comprises a plurality of image areas which are in one-to-one correspondence with a plurality of objects, the plurality of objects comprise the one or more moving objects, and the image segmentation mask is used for indicating the positions of the image areas where the one or more moving objects are located.

The content in step 1602 may refer to the related description in step 602, and is not described again in this embodiment of the application.

Step 1603: based on the image segmentation mask, a plurality of segmentation regions in one-to-one correspondence with the plurality of objects are determined.

The content in step 1603 may refer to the related description in step 1602, which is not described in detail in this embodiment of the application.

Step 1604: and analyzing an object state corresponding to each of the plurality of partition areas from the code stream, wherein the object state comprises a static state or a motion state.

Step 1605: and analyzing the image area divided by the segmentation area corresponding to the motion state from the code stream based on the object state corresponding to each segmentation area in the plurality of segmentation areas.

Since one divided region corresponds to one object state, the object state can be a motion state or a static state, and each image region in the code stream is divided by the divided region, the image region divided by the divided region corresponding to the motion state can be directly analyzed from the code stream.

Step 1606: and rendering and displaying the image area where the one or more moving objects are located in the first frame image based on the image area divided by the segmentation area corresponding to the motion state to obtain a dynamic image.

That is, an image area divided by a divided area corresponding to the motion state is rendered and displayed in the first frame image, so that a moving image is obtained.

The one or more moving objects mentioned in steps 1601 to 1606 above may be all moving objects in the plurality of objects included in the moving image. Of course, the one or more moving objects may also be part of the moving objects in the plurality of objects. That is, for moving objects in a moving image, it can be determined at the decoding end whether all the moving objects are in a moving state or a part of the moving objects need to be screened out again.

In the embodiment of the application, after the first frame image is analyzed from the code stream, the image area divided by the segmentation area corresponding to the motion state can be analyzed from the code stream according to the object state corresponding to each segmentation area, the image area divided by the segmentation area corresponding to the static state does not need to be analyzed, and the decoding complexity and the power consumption are effectively reduced. In addition, in the display process of the dynamic image, the image area divided by the segmentation area corresponding to the motion state is rendered and refreshed for display only on the basis of the first frame image, so that the display power consumption is effectively reduced.

Next, with reference to fig. 17, an exemplary description will be given of a moving image encoding and decoding method according to the embodiment shown in fig. 15 and 16. In this embodiment, the image partitioning in the existing video coding standard is utilized. At the encoding end, each frame of image in the dynamic image is divided into a plurality of slice or tile with fixed mode by using an image segmentation mask, and each slice/tile can be decoded independently.

And a coding end step:

1) The user selects the type of encoder to use, and the following syntax elements are coded at the system layer to indicate that:

video _ codec _ type: the type of video encoder that encodes the moving picture itself, such as h.265. Other types of encoders, such as h.264, may also be indicated, without limitation.

3) Calling a corresponding encoder according to the mask _ codec _ type to encode an image segmentation mask, wherein the encoding of the image segmentation mask can use a mainstream image encoder;

4) Each frame of image in the dynamic image is divided into slices or tiles of a fixed pattern by using an image segmentation mask.

The position of the moving object in the image is first determined using an image segmentation mask. The pixel value of the object K in the image segmentation mask is denoted Mk. The specific object position determination method is as follows:

assuming that the position of the object K is currently determined, the image segmentation mask is scanned line by line, and the coordinates of the pixel value Mk in the image segmentation mask are recorded, and the coordinates form a set: { (xk, 1, yk, 1), (xk, 2, yk, 2), \ 8230 { (xk, N, yk, N) };

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝max{yk,1,yk,2,…,yk,N}

the position of the object in the set { (x, y) | min _ Xk < = x < = max _ Xk, min _ Yk < = y < = max _ Yk) } is the position where the object K is located;

and (3) boundary processing: if the area determined by the coordinate set does not contain an integer number of CTUs (the upper part, the lower part, the left part and the right part are not at the CTU boundary), filling a plurality of row/column pixel points upwards, downwards, leftwards and rightwards respectively to ensure that the current area contains an integer number of CTUs;

if tile is used for area division, the square area can be directly used as an independent tile; if slice is used for dividing the region, the region formed by the CTUs in each row of the square region is used as an independent slice;

5) And calling a corresponding video encoder according to the video _ codec _ type to independently encode the divided slice or tile. When coding, it needs to restrain the inter-frame prediction motion vector range from having to be at slice/tile of the corresponding position of the reference image, and MCTS can be used for H.265 coder;

A decoding end step:

1) The system layer decodes the following information:

image_codec_type；

mask_codec_type；

video_codec_type。

2) The system layer extracts each sub-code stream from the code stream for subsequent decoding;

3) Calling a corresponding image decoder to decode and display a first frame image according to the image _ codec _ type;

5) The system layer controls a decoder to decode only the slice or tile corresponding to the object in the motion state according to the image segmentation mask;

6) Dividing the dynamic image into slices or tiles in a fixed mode by using an image segmentation mask;

finding the minimum and maximum values in the coordinates:

min_Xk＝min{xk,1,xk,2,…,xk,N}

min_Yk＝min{yk,1,yk,2,…,yk,N}

max_Xk＝max{xk,1,xk,2,…,xk,N}

max_Yk＝max{yk,1,yk,2,…,yk,N}

the position of the object in the set { (x, y) | min _ Xk < = x < = max _ Xk, min _ Yk < = y < = max _ Yk) } is the position of the object K;

boundary processing: if the region formed by the set does not contain an integer number of CTUs (the upper part, the lower part, the left part and the right part are not at the CTU boundary), filling a plurality of rows/columns of pixels upwards, downwards, leftwards and rightwards respectively so that the current region contains an integer number of CTUs;

7) The system layer skips slice/tile corresponding to the object in the static state by using the segmentation area and the object state obtained by the image segmentation mask, and only decodes slice/tile corresponding to the object in the motion state;

8) And rendering and displaying and refreshing each object at the corresponding position. And the rendering sequence of the objects is consistent with the sequence of the slice/tile code stream in the whole code stream.

In addition, the interactivity of the user can be increased on the basis of fig. 17, that is, the user at the decoding end can choose to let a specific object move, and the other areas remain still.

Fig. 18 is a schematic structural diagram of a dynamic image encoding apparatus according to an embodiment of the present disclosure, where the encoding apparatus may be implemented as part or all of an encoding end device by software, hardware, or a combination of the two, and the encoding end device may be the source device shown in fig. 1 or the cloud server shown in fig. 2. Referring to fig. 18, the apparatus includes: a semantic segmentation module 1801, an image sequence extraction module 1802, a position indication information determination module 1803, and a first encoding module 1804.

The semantic segmentation module 1801 is configured to perform semantic segmentation on any frame of image in a dynamic image to obtain an image segmentation mask, where the dynamic image includes multiple objects, and the image segmentation mask includes multiple image regions corresponding to the multiple objects one to one. For the detailed implementation process, reference is made to corresponding contents in the foregoing embodiments, and details are not repeated here. Among them, the semantic segmentation module 1801 corresponds to the semantic segmentation module 111 in fig. 3.

An image sequence extraction module 1802 for determining a moving image sequence based on the moving images, each frame of image in the moving image sequence including an image area where one or more moving objects of the plurality of objects are located. For the detailed implementation process, reference is made to corresponding contents in the foregoing embodiments, and details are not repeated here. Wherein the image sequence extraction module 1802 corresponds to the image sequence extraction module 112 in fig. 3.

A position indication information determining module 1803, configured to determine, based on the image segmentation mask, position indication information indicating a position of an image region where the one or more moving objects are located. For the detailed implementation process, reference is made to corresponding contents in the above embodiments, and details are not repeated here. A module corresponding to the location indication information determining module 1803 is not shown in fig. 3.

A first coding module 1804, configured to code the motion image sequence and the position indication information into the code stream. For the detailed implementation process, reference is made to corresponding contents in the above embodiments, and details are not repeated here. Among them, the first encoding module 1804 corresponds to the position indication information encoding module 113 and the first video encoding module 115 in fig. 3.

Optionally, the motion image sequence comprises one or more sub-image sequences, and the position indication information is an image segmentation mask;

the image sequence extraction module 1802 includes:

and the image sequence extraction sub-module is used for extracting the one or more sub-image sequences based on the image segmentation mask and the dynamic image, and the one or more sub-image sequences are in one-to-one correspondence with the one or more moving objects.

Optionally, the moving image sequence comprises one or more sub-image sequences, and the position indication information comprises coordinates of one or more specified positions;

the image sequence extraction module 1802 includes:

an image sequence extraction sub-module, configured to extract the one or more sub-image sequences based on an image segmentation mask and a dynamic image, where the one or more sub-image sequences correspond to the one or more moving objects one to one;

the location indication information determining module 1803 includes:

and the position coordinate determination sub-module is used for determining the coordinates of the specified position in the dynamic image in the image area where each moving object in the one or more moving objects is positioned based on the image segmentation mask.

Optionally, the image sequence extraction sub-module includes:

a selection sub-module, configured to select a moving object from the one or more moving objects, determine a sub-image sequence corresponding to the selected moving object by the following modules until the sub-image sequence corresponding to each moving object is determined:

the position area determining submodule is used for determining the position area where the selected moving object is located based on the image segmentation mask;

and the image area extraction submodule is used for extracting the image area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image based on the position area to obtain a sub-image sequence corresponding to the selected moving object.

Optionally, the location area determining submodule is specifically configured to:

scanning each pixel point in the image segmentation mask to obtain a pixel coordinate set corresponding to the selected moving object, wherein the pixel coordinate set comprises coordinates of a plurality of pixel points;

and determining the position area formed by the pixel coordinate set as the position area where the selected moving object is positioned.

Optionally, the image region extraction sub-module is specifically configured to:

extracting an image area located in the position area from each frame image except the first frame image in the dynamic image;

or,

and expanding the position area to enable the expanded position area to be a square area, and extracting an image area positioned in the expanded position area from each frame of image except the first frame of image in the dynamic image.

Alternatively, the designated position is a position with the smallest coordinates, or a position with the largest coordinates.

Optionally, the apparatus further comprises:

and the second coding module is used for coding the number of the one or more moving objects into a code stream. Wherein, the corresponding modules of the second encoding module are not shown in fig. 3.

Alternatively, the moving image sequence is a moving image, and the position indication information is an image division mask.

Optionally, the apparatus further comprises:

a divided region determining module for determining a plurality of divided regions corresponding to the plurality of objects one to one based on the image division mask;

the region dividing module is used for dividing the regions of each frame of image except the first frame of image in the dynamic image according to the plurality of divided regions to obtain a plurality of image regions;

the object state determining module is used for determining an object state corresponding to each of the plurality of divided areas, and the object state comprises a static state or a motion state;

the first encoding module includes:

the image area coding submodule is used for coding the plurality of image areas into a code stream;

the device also includes:

and the third encoding module is used for encoding the object state corresponding to each partition area in the plurality of partition areas into the code stream.

The modules corresponding to the segmentation region determining module, the region dividing module, the object state determining module and the third encoding module are not shown in fig. 3. The image region encoding sub-module corresponds to the first video encoding module 115 in fig. 3.

Optionally, the segmentation area determination module is specifically configured to:

determining a position area where each object in the plurality of objects is located based on the image segmentation mask;

under the condition that the position area of any object in the plurality of objects does not contain an integral number of Coding Tree Units (CTUs), expanding the boundary of the position area of any object so that the position area of any object contains an integral number of CTUs;

and determining the position areas where the plurality of objects are positioned after the expansion processing as the plurality of segmentation areas.

Optionally, the image region encoding sub-module is specifically configured to:

coding each image area in the plurality of image areas into a code stream as a coding block;

or,

coding an area formed by each row of CTUs in each image area in the plurality of image areas into a code stream as a coding block;

wherein, the position area of the reference coding block is positioned in the position area of the referenced coding block.

Optionally, the apparatus further comprises:

and the fourth encoding module is used for encoding the first frame image of the dynamic image into a code stream. Wherein the fourth encoding module corresponds to the image encoding module 114 in fig. 3.

It should be noted that: in the encoding apparatus for moving pictures provided in the above embodiments, only the division of the functional modules is illustrated when encoding moving pictures, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the above described functions. In addition, the coding apparatus for a dynamic image and the coding method for a dynamic image provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

Fig. 19 is a schematic structural diagram of a moving picture decoding apparatus according to an embodiment of the present application, where the decoding apparatus may be implemented by software, hardware, or a combination of the two as part or all of a decoding-side device, and the decoding-side device may be the destination device shown in fig. 1 or the terminal device shown in fig. 2. Referring to fig. 19, the apparatus includes: an image decoding module 1901, a first decoding module 1902, and an image composition module 1903.

The image decoding module 1901 is configured to parse a first frame image from the code stream. For the detailed implementation process, reference is made to corresponding contents in the above embodiments, and details are not repeated here. Among them, the image decoding module 1901 corresponds to the image decoding module 212 in fig. 4.

The first decoding module 1902 is configured to parse a moving image sequence and position indication information from the code stream, where each frame of image in the moving image sequence includes an image area where one or more moving objects are located, and the position indication information is used to indicate positions of the image areas where the one or more moving objects are located. For the detailed implementation process, reference is made to corresponding contents in the above embodiments, and details are not repeated here. The first decoding module 1902 corresponds to the position indication information decoding module 211 and the first video decoding module 213 in fig. 4.

An image composition module 1903, configured to render and display an image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information, so as to obtain a dynamic image. For the detailed implementation process, reference is made to corresponding contents in the above embodiments, and details are not repeated here. Among other things, the image synthesis module 1903 corresponds to the image synthesis module 214 in fig. 4.

Optionally, the moving image sequence comprises one or more sub-image sequences, the one or more sub-image sequences corresponding to the one or more moving objects one to one;

the position indication information is an image segmentation mask including a plurality of image regions in one-to-one correspondence with a plurality of objects including the one or more moving objects.

Optionally, the image synthesis module 1903 comprises:

the selection sub-module is used for selecting one moving object from the one or more moving objects, and rendering and displaying the image area where the selected moving object is located through the following modules until the image area where each moving object is located is rendered and displayed:

the position determining submodule is used for determining the position of an image area where the selected moving object is located based on the image segmentation mask;

and the rendering and displaying sub-module is used for rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image according to the position of the image area where the selected moving object is located.

Optionally, the position determination submodule is specifically configured to:

the position area constituted by the pixel coordinate set is determined as the position of the image area where the selected moving object is located, or the position area constituted by the pixel coordinate set is expanded so that the expanded position area is a square area, and the expanded position area is determined as the position of the image area where the selected moving object is located.

the position indication information includes coordinates in the dynamic image of a specified position within an image area where each of the one or more moving objects is located.

Optionally, the image synthesis module 1903 is specifically configured to:

selecting one moving object from the one or more moving objects, and rendering and displaying an image area where the selected moving object is located according to the following operations until the image area where each moving object is located is rendered and displayed:

and according to the coordinates of the designated position in the image area where the selected moving object is positioned in the dynamic image, rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image.

Optionally, the apparatus further comprises:

and the second decoding module is used for analyzing the number of the one or more moving objects from the code stream. For the detailed implementation process, reference is made to corresponding contents in the above embodiments, and details are not repeated here. Wherein, the corresponding modules of the second decoding module are not shown in fig. 4.

Alternatively, the moving image sequence is a moving image, and the position indication information is an image segmentation mask including a plurality of image regions in one-to-one correspondence with a plurality of objects including the one or more moving objects.

Optionally, the image synthesis module 1903 is specifically configured to:

determining the position of an image area where the selected moving object is located based on the image segmentation mask;

extracting an image area where the selected moving object is located from each frame image except the first frame image in the dynamic image based on the position of the image area where the selected moving object is located;

and according to the position of the image area where the selected moving object is positioned, rendering and displaying the image area where the selected moving object is positioned in each frame image of the dynamic image in the first frame image.

Optionally, the position indication information is an image segmentation mask, the image segmentation mask includes a plurality of image regions in one-to-one correspondence with a plurality of objects, and the plurality of objects include the one or more moving objects;

the first decoding module 1902 includes:

a divided region determining submodule for determining a plurality of divided regions corresponding to the plurality of objects one to one based on the image division mask;

the object state determining submodule is used for analyzing an object state corresponding to each of the plurality of segmentation areas from the code stream, and the object state comprises a static state or a motion state;

and the image area decoding submodule is used for analyzing the image areas divided by the segmentation areas corresponding to the motion states from the code stream based on the object states corresponding to the segmentation areas in the plurality of segmentation areas to obtain the motion image sequence.

Optionally, the segmentation region determination sub-module is specifically configured to:

under the condition that the position area where any object in the plurality of objects is located does not contain the integral number of CTUs, expanding the boundary of the position area where any object is located so that the position area where any object is located contains the integral number of CTUs;

Optionally, the apparatus further comprises:

an instruction receiving module for receiving an object selection instruction for selecting one or more objects from a plurality of objects included in a moving image;

and the moving object determining module is used for determining one or more objects selected by the object selecting instruction as the one or more moving objects.

Optionally, the apparatus further comprises:

the third decoding module is used for analyzing the type of the encoder used for encoding from the code stream;

and the decoder type determining module is used for determining the corresponding decoder type according to the analyzed encoder type.

The modules corresponding to the instruction receiving module, the moving object determining module, the third decoding module and the decoder type determining module are not shown in fig. 4.

In the method for decoding a dynamic image provided in the embodiment of the present application, after the first frame image is decoded, for the subsequent image, only the image area where the moving object is located needs to be decoded, and the image area where the static object is located does not need to be decoded, thereby effectively reducing the decoding complexity and power consumption. In addition, in the display process of the dynamic image, the image area where the moving object is located is rendered and refreshed to be displayed only on the basis of the first frame image, so that the display power consumption is effectively reduced.

It should be noted that: in the decoding apparatus for moving pictures provided in the above embodiments, only the division of the above functional modules is illustrated when decoding a moving picture, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the above described functions. In addition, the decoding apparatus for a dynamic image and the decoding method for a dynamic image provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

Fig. 20 is a schematic block diagram of a codec device 2000 used in the embodiment of the present application. The codec device 2000 may include a processor 2001, a memory 2002, and a bus system 2003, among others. The processor 2001 and the memory 2002 are connected by a bus system 2003, the memory 2002 is used for storing instructions, and the processor 2001 is used for executing the instructions stored in the memory 2002 so as to execute various dynamic image encoding or decoding methods described in the embodiments of the present application. To avoid repetition, it is not described in detail here.

In this embodiment, the processor 2001 may be a Central Processing Unit (CPU), and the processor 2001 may also be other general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 2002 may comprise a ROM device or a RAM device. Any other suitable type of memory device may also be used as memory 2002. The memory 2002 may include code and data 20021 that are accessed by the processor 2001 using the bus 2003. The memory 2002 may further include an operating system 20023 and an application 20022, the application 20022 including at least one program that allows the processor 2001 to execute the method of encoding or decoding a moving image described in the embodiments of the present application. For example, the application 20022 may include applications 1 to N, which further include a moving image encoding or decoding application (simply referred to as a moving image codec application) that performs the method of encoding or decoding a moving image described in the embodiment of the present application.

The bus system 2003 may include a power bus, a control bus, a status signal bus, and the like, in addition to the data bus. For clarity of illustration, however, the various buses are identified in the figure as the bus system 2003.

Optionally, the codec 2000 may also include one or more output devices, such as a display 2004. In one example, the display 2004 may be a touch-sensitive display that incorporates a display with a touch-sensing unit operable to sense touch input. The display 2004 may be connected to the processor 2001 via a bus 2003.

It should be noted that the encoding/decoding apparatus 2000 may execute the method for encoding a moving image in the embodiment of the present application, and may also execute the method for decoding a moving image in the embodiment of the present application.

Those of skill in the art will appreciate that the functions described in connection with the various illustrative logical blocks, modules, and algorithm steps described in the disclosure herein may be implemented as hardware, software, firmware, or any combination thereof. If implemented in software, the functions described in the various illustrative logical blocks, modules, and steps may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium, which corresponds to a tangible medium, such as a data storage medium, or any communication medium including a medium that facilitates transfer of a computer program from one place to another (e.g., according to a communication protocol). In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described herein. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, DVD and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Additionally, in some aspects, the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques may be fully implemented in one or more circuits or logic elements. In one example, various illustrative logical blocks, units, and modules within the encoder 100 and the decoder 200 may be understood as corresponding circuit devices or logical elements.

The techniques of embodiments of the present application may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in embodiments of the application to emphasize functional aspects of means for performing the disclosed techniques, but do not necessarily require realization by different hardware units. Indeed, as described above, the various units may be combined in a codec hardware unit, in conjunction with suitable software and/or firmware, or provided by an interoperating hardware unit (including one or more processors as described above).

That is, in the above embodiments, may be wholly or partially implemented by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others. It is noted that the computer-readable storage medium mentioned in the embodiments of the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that reference herein to "a plurality" means two or more. In the description of the embodiments of the present application, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, words such as "first" and "second" are used to distinguish identical items or similar items with substantially identical functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The above-mentioned embodiments are provided by way of example and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for encoding a moving picture, the method comprising:

performing semantic segmentation on any frame of image in a dynamic image to obtain an image segmentation mask, wherein the dynamic image comprises a plurality of objects, and the image segmentation mask comprises a plurality of image areas which correspond to the objects one by one;

determining a moving image sequence based on the dynamic images, wherein each frame of image in the moving image sequence comprises an image area where one or more moving objects in the plurality of objects are located;

determining position indication information based on the image segmentation mask, the position indication information indicating a position of an image region in which the one or more moving objects are located;

and coding the motion image sequence and the position indication information into a code stream.

2. The method of claim 1, wherein the motion image sequence comprises one or more sub-image sequences, the position indication information being the image segmentation mask;

the determining a moving image sequence based on the moving image comprises:

extracting the one or more sub-image sequences based on the image segmentation mask and the dynamic image, the one or more sub-image sequences corresponding to the one or more moving objects one to one.

3. The method according to claim 1, wherein the moving image sequence includes one or more sub-image sequences, the position indication information includes coordinates of one or more specified positions;

the determining a moving image sequence based on the moving image comprises:

extracting the one or more sub-image sequences based on the image segmentation mask and the dynamic image, wherein the one or more sub-image sequences correspond to the one or more moving objects one to one;

the determining position indication information based on the image segmentation mask comprises:

based on the image segmentation mask, coordinates in the dynamic image of a specified location within an image region in which each of the one or more moving objects is located are determined.

4. The method of claim 2 or 3, wherein said extracting the one or more sequences of sub-images based on the image segmentation mask and the dynamic image comprises:

selecting one moving object from the one or more moving objects, and determining a sub-image sequence corresponding to the selected moving object according to the following operations until the sub-image sequence corresponding to each moving object is determined:

determining a position area where the selected moving object is located based on the image segmentation mask;

and based on the position area, extracting an image area where the selected moving object is located from each frame of image except the first frame of image in the dynamic image to obtain a sub-image sequence corresponding to the selected moving object.

5. The method of claim 4, wherein said determining a location area where the selected moving object is located based on the image segmentation mask comprises:

and determining a position area formed by the pixel coordinate set as the position area where the selected moving object is located.

6. The method according to claim 4 or 5, wherein the extracting, based on the position area, an image area where the selected moving object is located from each frame image of the dynamic image except for the first frame image comprises:

extracting an image area positioned in the position area from each frame of image except the first frame of image in the dynamic image;

or,

7. The method of claim 3, wherein the designated location is a location with a minimum coordinate or a location with a maximum coordinate.

8. The method of claim 3, wherein the method further comprises:

and coding the number of the one or more moving objects into a code stream.

9. The method according to claim 1, wherein the moving image sequence is the moving image, and the position indication information is the image segmentation mask.

10. The method of claim 9, wherein the method further comprises:

determining a plurality of segmentation regions corresponding to the plurality of objects one to one based on the image segmentation mask;

according to the plurality of segmentation areas, carrying out area division on each frame of image except the first frame of image in the dynamic image to obtain a plurality of image areas;

determining an object state corresponding to each of the plurality of segmented regions, wherein the object state comprises a static state or a motion state;

the coding the moving image sequence into a code stream comprises the following steps:

encoding the plurality of image areas into a code stream;

the method further comprises the following steps:

and coding the object state corresponding to each partition area in the plurality of partition areas into a code stream.

11. The method of claim 10, wherein determining a plurality of segmented regions in one-to-one correspondence with the plurality of objects based on the image segmentation mask comprises:

under the condition that the position area where any object in the plurality of objects is located does not contain an integral number of Coding Tree Units (CTUs), expanding the boundary of the position area where any object is located so that the position area where any object is located contains an integral number of CTUs;

and determining the position areas of the plurality of objects after the expansion processing as the plurality of segmentation areas.

12. The method of claim 11, wherein said encoding the plurality of image regions into a codestream comprises:

or,

13. The method of any one of claims 1-12, further comprising:

and coding the first frame image of the dynamic image into a code stream.

14. A method for decoding a moving picture, the method comprising:

analyzing a first frame image from the code stream;

analyzing a moving image sequence and position indication information from the code stream, wherein each frame of image in the moving image sequence comprises image areas where one or more moving objects are located, and the position indication information is used for indicating the positions of the image areas where the one or more moving objects are located;

and rendering and displaying an image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information to obtain a dynamic image.

15. The method of claim 14, wherein the motion image sequence comprises one or more sub-image sequences, the one or more sub-image sequences being in one-to-one correspondence with the one or more moving objects;

16. The method according to claim 15, wherein the rendering and displaying, in the first frame image, an image area in which the one or more moving objects are located based on the moving image sequence and the position indication information, comprises:

and according to the position of the image area where the selected moving object is located, rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image.

17. The method of claim 16, wherein determining the location of the image region in which the selected moving object is located based on the image segmentation mask comprises:

and determining the position area formed by the pixel coordinate set as the position of the image area where the selected moving object is located, or expanding the position area formed by the pixel coordinate set to enable the expanded position area to be a square area, and determining the expanded position area as the position of the image area where the selected moving object is located.

18. The method of claim 14, wherein the motion image sequence comprises one or more sub-image sequences, the one or more sub-image sequences being in one-to-one correspondence with the one or more moving objects;

the position indication information includes coordinates of a specified position in the dynamic image within an image area in which each of the one or more moving objects is located.

19. The method according to claim 18, wherein the rendering and displaying, in the first frame image, an image area in which the one or more moving objects are located based on the moving image sequence and the position indication information, comprises:

and rendering and displaying an image area included in the sub-image sequence corresponding to the selected moving object in the first frame image according to the coordinate of the designated position in the dynamic image in the image area where the selected moving object is located.

20. The method of claim 18 or 19, wherein the designated position is a position with a minimum coordinate or a position with a maximum coordinate.

21. The method of any one of claims 18-20, further comprising:

and analyzing the number of the one or more moving objects from the code stream.

22. The method according to claim 14, wherein the moving image sequence is the moving image, the position indication information is an image segmentation mask including a plurality of image regions in one-to-one correspondence with a plurality of objects including the one or more moving objects.

23. The method according to claim 22, wherein the rendering and displaying, in the first frame image, an image area in which the one or more moving objects are located based on the moving image sequence and the position indication information, comprises:

extracting an image area where the selected moving object is located from each frame image except for a first frame image in the dynamic image based on the position of the image area where the selected moving object is located;

and rendering and displaying the image area of the selected moving object in each frame image of the dynamic image in the first frame image according to the position of the image area of the selected moving object.

24. The method of claim 14, wherein the position indication information is an image segmentation mask including a plurality of image regions in one-to-one correspondence with a plurality of objects, the plurality of objects including the one or more moving objects;

the analyzing of the motion image sequence from the code stream includes:

analyzing an object state corresponding to each partition area in the plurality of partition areas from the code stream, wherein the object state comprises a static state or a motion state;

and analyzing the image area divided by the segmentation area corresponding to the motion state from the code stream based on the object state corresponding to each segmentation area in the plurality of segmentation areas to obtain the motion image sequence.

25. The method of claim 24, wherein determining a plurality of segmented regions in one-to-one correspondence with the plurality of objects based on the image segmentation mask comprises:

when the position area of any object in the plurality of objects does not contain an integer number of CTUs, expanding the boundary of the position area of any object so that the position area of any object contains an integer number of CTUs;

26. The method according to any one of claims 14 to 20 and 22 to 25, wherein before parsing out the moving image sequence and the position indication information from the code stream, the method further comprises:

receiving an object selection instruction for selecting one or more objects from a plurality of objects included in the moving image;

determining one or more objects selected by the object selection instruction as the one or more moving objects.

27. The method of any of claims 14-26, further comprising:

analyzing the type of an encoder used for encoding from the code stream;

and determining the corresponding decoder type according to the analyzed encoder type.

28. An apparatus for encoding a moving picture, the apparatus comprising:

the semantic segmentation module is used for performing semantic segmentation on any frame of image in a dynamic image to obtain an image segmentation mask, wherein the dynamic image comprises a plurality of objects, and the image segmentation mask comprises a plurality of image areas which are in one-to-one correspondence with the objects;

an image sequence extraction module, configured to determine a moving image sequence based on the dynamic images, where each frame of image in the moving image sequence includes an image area where one or more moving objects in the plurality of objects are located;

a position indication information determination module, configured to determine position indication information based on the image segmentation mask, where the position indication information is used to indicate a position of an image region where the one or more moving objects are located;

and the first coding module is used for coding the motion image sequence and the position indication information into a code stream.

29. The apparatus of claim 28, wherein the motion image sequence comprises one or more sub-image sequences, the position indication information being the image segmentation mask;

the image sequence extraction module comprises:

30. The apparatus according to claim 28, wherein the moving image sequence comprises one or more sub-image sequences, the position indication information comprises coordinates of one or more specified positions;

the image sequence extraction module comprises:

an image sequence extraction sub-module, configured to extract the one or more sub-image sequences based on the image segmentation mask and the dynamic image, where the one or more sub-image sequences correspond to the one or more moving objects one to one;

the position indication information determination module includes:

and the position coordinate determination submodule is used for determining the coordinates of the specified position in the dynamic image in the image area where each moving object in the one or more moving objects is located based on the image segmentation mask.

31. The apparatus of claim 29 or 30, wherein the image sequence extraction sub-module comprises:

a selection sub-module, configured to select a moving object from the one or more moving objects, determine a sub-image sequence corresponding to the selected moving object through the following modules until the sub-image sequence corresponding to each moving object is determined:

a position area determination submodule for determining a position area where the selected moving object is located based on the image segmentation mask;

32. The apparatus of claim 31, wherein the location area determination submodule is specifically configured to:

33. The apparatus of claim 31 or 32, wherein the image region extraction sub-module is specifically configured to:

or,

34. The apparatus of claim 30, wherein the designated location is a location with a minimum coordinate or a location with a maximum coordinate.

35. The apparatus of claim 30, wherein the apparatus further comprises:

and the second coding module is used for coding the number of the one or more moving objects into a code stream.

36. The apparatus according to claim 28, wherein the moving image sequence is the moving image, and the position indication information is the image segmentation mask.

37. The apparatus of claim 36, wherein the apparatus further comprises:

a segmentation region determination module for determining a plurality of segmentation regions corresponding to the plurality of objects one to one based on the image segmentation mask;

the region dividing module is used for dividing each frame of image except the first frame of image in the dynamic image into a plurality of image regions according to the plurality of divided regions;

an object state determination module, configured to determine an object state corresponding to each of the plurality of segmented regions, where the object state includes a static state or a motion state;

the first encoding module comprises:

the device further comprises:

and the third coding module is used for coding the object state corresponding to each partition area in the plurality of partition areas into the code stream.

38. The apparatus of claim 37, wherein the split region determination module is specifically configured to:

39. The apparatus of claim 38, wherein the image region encoding sub-module is specifically configured to:

or,

and the position area where the reference coding block is located is positioned in the position area where the referenced coding block is located.

40. The apparatus of any of claims 28-39, wherein the apparatus further comprises:

and the fourth encoding module is used for encoding the first frame image of the dynamic image into a code stream.

41. An apparatus for decoding a moving picture, the apparatus comprising:

the image decoding module is used for analyzing a first frame image from the code stream;

the first decoding module is used for analyzing a moving image sequence and position indication information from the code stream, wherein each frame of image in the moving image sequence comprises image areas where one or more moving objects are located, and the position indication information is used for indicating the positions of the image areas where the one or more moving objects are located;

and the image synthesis module is used for rendering and displaying an image area where the one or more moving objects are located in the first frame image based on the moving image sequence and the position indication information to obtain a dynamic image.

42. The apparatus of claim 41, wherein the motion image sequence comprises one or more sub-image sequences, the one or more sub-image sequences being in one-to-one correspondence with the one or more moving objects;

43. The apparatus of claim 42, wherein the image synthesis module comprises:

a position determination submodule for determining a position of an image region where the selected moving object is located, based on the image segmentation mask;

and the rendering and displaying submodule is used for rendering and displaying the image area included in the sub-image sequence corresponding to the selected moving object in the first frame image according to the position of the image area where the selected moving object is located.

44. The apparatus of claim 43, wherein the location determination submodule is specifically configured to:

45. The apparatus of claim 41, wherein the motion image sequence comprises one or more sub-image sequences, the one or more sub-image sequences being in one-to-one correspondence with the one or more moving objects;

46. The apparatus of claim 45, wherein the image synthesis module is specifically configured to:

47. The apparatus of claim 45 or 46, wherein the designated location is a location with a minimum coordinate or a location with a maximum coordinate.

48. The apparatus of any one of claims 45-47, wherein the apparatus further comprises:

and the second decoding module is used for analyzing the number of the one or more moving objects from the code stream.

49. The apparatus according to claim 41, wherein the moving image sequence is the moving image, the position indication information is an image segmentation mask including a plurality of image areas in one-to-one correspondence with a plurality of objects including the one or more moving objects.

50. The apparatus of claim 49, wherein the image synthesis module is specifically to:

51. The apparatus of claim 41, wherein the position indication information is an image segmentation mask comprising a plurality of image regions in one-to-one correspondence with a plurality of objects, the plurality of objects comprising the one or more moving objects;

the first decoding module includes:

the object state determining submodule is used for analyzing an object state corresponding to each partition area in the plurality of partition areas from the code stream, and the object state comprises a static state or a motion state;

52. The apparatus of claim 51, wherein the split region determination submodule is specifically configured to:

53. The apparatus of any of claims 41-47, 49-52, wherein the apparatus further comprises:

an instruction receiving module configured to receive an object selection instruction, the object selection instruction being configured to select one or more objects from a plurality of objects included in the dynamic image;

a moving object determination module for determining one or more objects selected by the object selection instruction as the one or more moving objects.

54. The apparatus of any one of claims 41-53, wherein the apparatus further comprises:

the third decoding module is used for analyzing the encoder type used for encoding from the code stream;

55. An encoding end device is characterized by comprising a memory and a processor;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory to realize the encoding method of the dynamic image according to any one of claims 1 to 13.

56. A decoding-side device, characterized in that the decoding-side device comprises a memory and a processor;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory to realize the dynamic image decoding method according to any one of claims 14 to 27.

57. A computer-readable storage medium having instructions stored therein, which when executed on the computer, cause the computer to perform the steps of the method of any of claims 1-27.