CN113453006B

CN113453006B - Picture packaging method, device and storage medium

Info

Publication number: CN113453006B
Application number: CN202110234142.4A
Authority: CN
Inventors: 文格尔史蒂芬; 崔秉斗; 赵帅
Original assignee: Tencent America LLC
Current assignee: Tencent America LLC
Priority date: 2020-03-25
Filing date: 2021-03-03
Publication date: 2024-04-16
Anticipated expiration: 2041-03-03
Also published as: CN113453006A

Abstract

The embodiment of the disclosure discloses a picture packaging method, a device and a storage medium, wherein the method comprises the following steps: obtaining a plurality of NAL units, the plurality of NAL units including a first NAL unit of a picture and a last NAL unit of the picture; dividing a first NAL unit of a picture into a plurality of first fragments, and dividing a last NAL unit of the picture into a plurality of last fragments; encapsulating the plurality of first fragments into a plurality of first Fragment Unit (FU) packets and encapsulating the plurality of last fragments into a plurality of last FU packets; and transmitting the plurality of first FU packets and the plurality of last FU packets, wherein a last FU packet of the plurality of last FU packets comprises a last FU header comprising a last R-bit, and wherein the last R-bit is set to 1.

Description

Picture packaging method, device and storage medium

Cross reference

The present application claims priority from U.S. provisional application No. 62/994,563, filed on even date 25 in month 3 in 2020, and U.S. application No. 17/077,546, filed on even date 22 in 10 in 2020, the entire contents of which are incorporated herein by reference.

Technical Field

The disclosed subject matter relates to video encoding and decoding, and in particular to signaling of picture boundary information for supporting individual access of pictures in video payload formats.

Background

Real-time transport protocol (RTP: real-time Transport Protocol) is a network protocol for delivering video over an IP network, which has been used in communication systems utilizing streaming media, such as video conferencing applications. The RTP payload format has recently received attention for carrying video data conforming to the coding standards ITU-T recommendation [ H.266] and the international standard ISO (International Organization for Standardization)/IEC (International Electrotechnical Commission) [ ISO23090-3], both also known as next generation video coding (VVC: versatile Video Coding) and developed by the joint video expert group (JVET: joint Video Experts Team). The RTP payload format allows for encapsulation of at least one network abstraction layer (NAL: network Abstraction Layer) unit in each RTP packet (data packet) payload and fragmentation of the NAL units into multiple RTP packets (data packets).

At least some video coding standards recognize the concept of Access Units (AU). In the case of a single layer, the access unit may consist of a single coded picture. In other cases, particularly those involving layered coding and multiview coding, an AU may include multiple coded pictures that share some timing information, e.g., have the same presentation time.

The RTP header may include so-called "marker" bits (M bits). Conventionally, in almost all RTP payload formats, which can identify the concept of an AU, the M bits are designated equal to one for the RTP packet carrying the last bit string of the AU, otherwise set to zero. When the receiver receives the RTP packet set with M bits, it knows that the RTP packet is the last data packet of the AU and can process accordingly. Some details of this processing can be found in the RTP specification.

At least some video coding standards further identify the concept of coded pictures, which may be different from AUs. The AU and the coded picture may be different, e.g. if the AU consists of several coded pictures, it may be different when spatial or SNR scalability is used; or may be different in the case of a redundant picture.

If the transmitting endpoint obtains its transmitted video bitstream from a storage device/hard drive, such a file may not include readily accessible meta information about the access unit or encoded picture boundaries, for example, because the bitstream may be stored, for example, in a format commonly referred to as an "annex B bitstream". In such a scenario, there may be no Application Program Interface (API) information available from the encoder to the RTP encapsulator that signals that the bit string of the bitstream is the final bit string of an AU or the final bit string of an encoded picture. Instead, the RTP encapsulator may have to identify the bit string that includes the end of an AU or encoded picture without side information that is typically available to the encoder.

Disclosure of Invention

In an embodiment, a method of encapsulating a plurality of Network Abstraction Layer (NAL) units of a picture is provided, the method comprising: obtaining the plurality of NAL units, the plurality of NAL units including a first NAL unit of the picture and a last NAL unit of the picture; splitting the first NAL unit of the picture into a plurality of first slices and splitting the last NAL unit of the picture into a plurality of last slices; encapsulating the plurality of first fragments into a plurality of first fragment unit FU packets, and encapsulating the plurality of last fragments into a plurality of last FU packets; and transmitting the plurality of first FU packets and the plurality of last FU packets, wherein a last FU header is included in one of the plurality of last FU packets, the last FU header includes a last R bit, and the last R bit is set to 1.

In an embodiment, there is provided an apparatus for encapsulating a plurality of Network Abstraction Layer (NAL) units of a picture, the apparatus comprising: an acquisition module for causing the at least one processor to acquire the plurality of NAL units, the plurality of NAL units including a first NAL unit of the picture and a last NAL unit of the picture; a partitioning module for causing the at least one processor to partition the first NAL unit of the picture into a plurality of first segments and partition the last NAL unit of the picture into a plurality of last segments; an encapsulation module for causing the at least one processor to encapsulate the plurality of first fragments into a first plurality of fragment unit, FU, packets and encapsulate the plurality of last fragments into a plurality of last FU packets; and a transmitting module configured to cause the at least one processor to transmit the plurality of first FU packets and the plurality of last FU packets, wherein a last FU packet of the plurality of last FU packets includes a last FU header, the last FU header includes a last R bit, and the last R bit is set to 1.

In an embodiment, a computing device is provided that includes a processor and a memory; the memory stores a computer program that, when executed by the processor, causes the processor to perform the methods described in the embodiments of the present disclosure.

In an embodiment, a non-transitory computer-readable medium storing instructions is provided, the instructions comprising at least one instruction that, when executed by at least one processor of a device for encapsulating a plurality of Network Abstraction Layer (NAL) units of a picture, cause the at least one processor to perform a method as described in embodiments of the disclosure.

In the above technical solution of the embodiments of the present disclosure, by encapsulating multiple network abstraction layer NAL units of a picture and setting a flag bit in a header of a last FU packet of multiple last FU packets of a last NAL unit of the picture, boundary information of the picture can be conveniently identified, and further, individual access of the picture can be efficiently performed.

Drawings

Further features, properties and various advantages of the disclosed subject matter will become more apparent from the following detailed description and drawings, in which:

fig. 1 is a schematic diagram of a simplified block diagram of a communication system according to an embodiment.

Fig. 2 is a schematic diagram of an RTP header according to an embodiment.

Fig. 3 is a schematic diagram of an RTP including a payload header and an actual payload, according to an embodiment.

Fig. 4 is a schematic diagram of a NAL unit header in a VVC with bit boundaries, according to an embodiment.

Fig. 5 is a schematic diagram of a Fragment Unit (FU) payload format according to an embodiment.

Fig. 6 is a schematic diagram of an FU header for next generation video coding (VVC) according to an embodiment.

Fig. 7 is a schematic diagram of a VCL NAL unit header with two FU structures according to an embodiment.

Fig. 8 is a flowchart of an example method for encapsulating multiple Network Abstraction Layer (NAL) units of a picture, according to an embodiment.

FIG. 9 is a schematic diagram of a computer system according to an embodiment.

Detailed Description

In an embodiment, a method for signaling and identifying picture boundaries for Real-time transport protocol (RTP: real-time Transport Protocol) payload formats is described for next generation video coding (VVC: versatile Video Coding), other protocols, and codecs. The indication of the picture boundaries may allow for efficient play-out buffer processing.

Referring to FIG. 1, a communication system may include at least one endpoint (11, 12, 13) that communicate with each other over an IP network (14) such as the Internet using real-time media such as voice, video, and/or other media. The system may further comprise at least one media aware network element (15), the at least one media aware network element (15) being configured to manipulate media sent by one endpoint before forwarding the media to other endpoints.

In some such system designs, an endpoint and/or media aware network element (MANE: media Aware Network Element) may include an RTP encapsulator that transmits RTP packets over the network to an RTP receiver located, for example, in another endpoint or MANE. In some cases, a sending endpoint may include a video camera functionally coupled to a video encoder, which in turn is coupled to an encapsulator, such that video captured by the video camera is transmitted over a network (14) from the sending endpoint, e.g., endpoint (11), to a receiving endpoint, e.g., endpoint (12), using RTP packets.

In some cases, the transmitting endpoint may not include a video encoder. Instead, video may be retrieved from a file stored on a hard disk drive or the like (16), the hard disk drive or the like (16) being coupled to the endpoint (11).

Some real-time communication technologies for video over the internet and other IP networks rely on RTP as indicated in RFC 3550. In some cases, RTP packets are transmitted from one endpoint or MANE to another endpoint or MANE over UDP over IP. Referring to fig. 2, an RTP header structure indicated in RFC3550 is illustrated. Each RTP packet starts with a header of the RTP packet. Fig. 2 illustrates a format of an RTP header indicated in RFC 3550.

The Version (V) field (201) is used to identify the Version of RTP and is equal to 2. A Padding (P) field (202) is used to indicate whether at least one additional Padding octet (ott) is contained at the end of the packet. An Extension (X) field (203) is used to indicate whether a fixed header is followed by exactly one header Extension (header Extension). The CSRC Count (CC) field (204) contains the number of CSRC identifiers (210) following the fixed header. The Marker (M) field (205) allows for marking of important events, such as access unit boundaries (Access Unit boundary) in a packet stream (packet stream). A Payload Type (PT) field indicates the Payload Type (206) -the Type of media in use, e.g., video encoded according to ITU-T recommendation h.264 using the RTP Payload format RFC 6184 with a set of RFC3984 parameters. PT is in many cases selected/negotiated by the call control protocol (call control protocol). For each RTP packet to be transmitted, the RTP sequence number (207) is incremented by one until a wrap-around. The RTP timestamp (208) is used to indicate the time at which the first sample represented in the packet was sampled (acquisition time) and is typically used as presentation time (presentation time). At least some video codecs have a time stamp of 90kHz, while for many audio codecs the time stamp is equal to the sampling rate, e.g. 8kHz,44.1kHz or 48kHz. The synchronization source (209) and the contribution source (210) are described below only to the extent necessary.

RTP follows the general approach of application layer framing, so adaptation of certain payloads, such as encoded video formats specified according to certain video encoding standards, may be specified by auxiliary specifications outside of the primary RTP specification called RTP payload format. Some RTP payload formats reuse bits of the network abstraction header (Network Abstraction Header) as their payload header, which is present in some video coding standards such as h.264 or h.265. In such RTP payload formats and video coding standards, the network abstraction layer unit (Network Abstraction Layer, NAL unit or Network Abstraction Layer Unit, NALU) may be a bit string of limited size covering one coded picture or a well-defined part thereof, such as slice, tile, GOB, etc.

The bit string may comprise a relatively short data structure like at its beginning, e.g. 8 or 16 bits in length, containing minimum information about the type of bit string comprised, and in some scenarios, hierarchical information.

As described above, the RTP header may include so-called "marker" bits (M bits) (205). Conventionally, in almost all RTP payload formats that identify the AU concept, for the RTP packet carrying the last bit string of an AU, the M bits are specified to be equal to one, otherwise they are set to zero. When the receiver receives an RTP packet having M-bit groups, it knows that the RTP packet is the last packet of the AU, and can process the RTP packet accordingly. Some details of this processing can be found in the RTP specification. Referring again to fig. 1, assuming that the transmitting endpoint (11) obtains its transmitted video bitstream from the storage device/hard drive (16), the file may not include readily accessible meta information about the access unit or encoded picture boundaries, for example, because the bitstream may be stored, for example, in a format commonly referred to as an "annex B bitstream". In such a scenario, there may not be application Interface (Application Programmer's Interface, API) information available from the encoder to the RTP encapsulator that signals that the bit string of the (signal) bitstream is the final bit string of an AU or encoded picture. Conversely, the RTP encapsulator may have to identify a bit string comprising the end of an AU or encoded picture that does not include side information commonly available to the encoder.

In one embodiment, the transport layer may use RTP packets to communicate media data including video and audio. Referring to fig. 3, each RTP packet starts with an RTP header. The RTP header fields have been described above. In the same or another embodiment, these RTP header fields may be set in accordance with RFC 3550 and applicable RTP payload specifications.

In the same or another embodiment, the RTP packet may further include an RTP payload header (RTP payload header) (302). For example, the RTP payload header format may be specified in an RTP payload specification that applies to a given payload. The given payload may be, for example, video encoded according to the VVC specification (also referred to as ITU-T rec.h.266). The purpose of the RTP payload header may include, for example:

a) Control information related to the payload and useful for decapsulators (depacketizers), jitter buffer management, etc. is provided to the extent that it is not available in the RTP header (301) and/or from the payload (303) itself is not available or readily available. For example, the payload (303) may be encoded using complex variable length codes, arithmetic codecs, etc., which may be sufficient for decoding purposes, but are too heavy for the decapsulators located in the MANEs;

b) Providing additional functionality. Examples include slices of video units (e.g., codec pictures, encoded slices, NAL units, etc.); aggregation of video units; redundant copies of certain syntax elements tp, easy access and/or redundancy in case of packet loss etc. may be achieved, etc.

The RTP payload header (202) may be followed by an RTP payload (303). The RTP payload may be encoded according to a media codec specification, such as an audio codec or video codec specification, and may include, for example, at least one compressed or uncompressed audio sample, a compressed or uncompressed picture or portion thereof, or the like.

Thereafter, embodiments may relate to video encoded according to the VVC specification and corresponding RTP payload formats.

VVC uses NAL unit based video bitstream structure. A NAL unit may be a bit string representing control data-a non-video coding layer (VCL-) NAL unit-or coded video bits of compressed video data (VCL NAL unit) in relation to a picture, slice, tile, or similar structure. According to some RTP payload formats, one RTP packet may carry a single NAL unit in its payload (203) (in which case the NAL unit header collectively acts as an RTP payload format header), multiple NAL units (an aggregate packet having its own NAL unit-like structure as an RTP payload header followed by two or more NAL units), and fragments of the NAL units (in which case the RTP payload header is used for control information for fragmentation and is followed by fragments of the NAL units).

Regardless of how many packets (or fragments thereof) the RTP packets carry, it is advantageous for the decapsulator to be able to identify the last packet of a given coded picture. In some non-hierarchical environments, this may be achieved by the marker (M) bits of the RTP header (205). In particular, it may be implemented according to certain RTP description files and RTP payload formats.

In the same or another embodiment, when the flag bit is set equal to 1, it indicates that the current packet may be the last packet of an access unit in the current RTP stream. When the flag bit is set equal to 0, it indicates that the current packet may not be the last packet of the access unit. Since in some non-layered environments the AU boundary may be the same as the encoded picture boundary, a flag bit may be indicated as the picture boundary. However, in a layered environment, as well as in some non-layered environments involving e.g. redundant pictures, the flag bit when set at an AU boundary cannot indicate the encoded picture boundary either, because there may be more picture boundaries than AU boundaries.

Referring to fig. 4, in the same or another embodiment, the VVC NAL unit header may include two bytes (16 bits). The Forbidden zero bit (Forbidden-zero bit) F (401) is always zero. The 5 bits are used to represent the NAL unit Type (Type) (404), meaning that there may be up to 32 types of NAL units or NAL unit-like structures. The range of values for VCL NAL units is between zero and 12, and the range of non-VCL NAL unit types is between 13 and 31. The Z bits (402), layerID (403), and Temporal ID (405) are used to manage space/SR and Temporal layering, respectively, and are not described in detail herein.

In the VVC RTP payload format, three different types of RTP packet payload structures are indicated. The receiver can identify the type of RTP packet payload by a type field in the payload header. A single NAL unit packet contains a single NAL unit in the payload, and the NAL unit header of the NAL unit also serves as the payload header. The aggregation packet (AP: aggregation Packet) contains more than one NAL unit within one access unit and is not further described herein. The slice packet (Fragmentation Packets) contains a slice unit (FU: fragmentation Unit) which in turn contains a subset of the individual NAL units.

A slicing unit (FU) enables the segmentation of a single NAL unit into multiple RTP packets. A slice of a NAL unit may be made up of an integer number of consecutive octets (octets) of the NAL unit. Fragments of the same NAL unit may be sent in consecutive order with increasing RTP sequence numbers. When a NAL unit is segmented and transmitted within an FU, it is then referred to as a segmented NAL unit.

Referring to fig. 5, in the same or another embodiment, an FU packet may include a NAL unit payload header (501), the NAL unit payload header (501) being used to indicate that the packet is a fragmented packet, the NAL unit 6-78 payload header (501) including various fields as follows: FU payload (505) and optional RTP padding (506), and FU header (502) and conditionally decoding sequence number differences encoded in network byte order (Decoding Order Number Difference, DONL) (504).

In the same or another embodiment, referring to fig. 6, the NAL unit type of the NAL unit is signaled in a FuType (604) with 5 bits, with a fragment of the NAL unit carried in the FU. The FU header may further comprise S bits, E bits, R bits. Setting an S bit for a first slice of the NAL unit (601), otherwise clearing it; and sets the E bit for the last slice of the NAL unit (602), otherwise clears it.

In the same or another embodiment, R bits (603) may be reserved for subsequent use; the R bit (603) is set to, for example, 0 by the encapsulator and ignored by the decapsulator (603).

In the same or another embodiment, R bits (603) may indicate a first slice of a first NAL unit in decoding order of the encoded picture. The bit may be set to 1 if the slice is the first slice of the first NAL unit in decoding order of the encoded picture, and set to 0 otherwise. The RTP payload specification may also reverse these semantics in that the bit may be set to 0 if the slice is the first slice of the first NAL unit in decoding order of the encoded picture, and to 1 otherwise.

In the same or another embodiment, R bits (603) may indicate the last slice of the last NAL unit in decoding order of the encoded picture. This bit may be set to 1 if the slice is the last slice of the last NAL unit in decoding order of the encoded picture, otherwise set to 0. The RTP payload specification may also reverse these semantics in that the bit may be set to 0 if the slice is the last slice of the last NAL unit in decoding order of the encoded picture, and set to 1 otherwise.

In the same or another embodiment, if a NAL unit is the last NAL unit of the bitstream, it may be determined that the NAL unit is the last NAL unit of the picture. NAL unit naluX may also be determined to be the last NAL unit of a picture if one of the following conditions is true: 1) The next VCL NAL unit naluY in decoding order has nal_unit_type (i.e., ph_nut) equal to 19, or 2) the high-order bit of the first byte after its NAL unit header (i.e., picture_header_in_slice_header_flag) is equal to 1.

In the same or another embodiment, syntax elements or bits similar to R bits may not be placed in the FU header, but in another appropriate syntax structure of the RTP payload header; for example, in the payload header itself, in the aggregate packet header, in the aggregate unit header, etc.

Referring to fig. 7, in the same or another embodiment, a NAL unit (713) is shown, the NAL unit (713) having been divided into two RTP packets to illustrate the use of FU. When transmitted over an IP network using RTP, fragments of the same NAL unit may be transmitted in consecutive order with increasing RTP sequence numbers.

NAL unit 713 may be divided into two fragments, and each fragment may be carried in its own RTP packet. Or may be divided into two or more packets.

For example, NAL unit (713) may contain n bits and be divided into two fragments that are carried as a first FU payload (710) of k bits and a second FU payload (712) of n-k bits. Each of the two FU payloads follows its respective FU header, e.g., FU payload (710) follows FU header (709) and FU payload (712) follows FU header (711).

In an embodiment, within the first FU header (709), an S bit (701) may be set and an E bit (702) may be cleared to indicate that this is the first fragment of the NAL unit. The Type (Type) field (704) is set to the Type of NAL unit. The R bit (703) may be set as described in one of the alternatives above. For example, if NAL unit (713) is the first NAL unit of a picture, R bit (703) may be set to indicate that the slice included in FU payload (710) is the first slice of the first NAL unit of a picture.

In the second FU header (711), the S bit (705) is cleared and the E bit (706) is set to indicate that this is the last fragment of the NAL unit. The Type (Type) field (708) is set to the Type of NAL unit. The R bit (707) is set as described in one of the alternatives above. For example, if NAL unit (713) is the last NAL unit of a picture, R bit (707) may be set to indicate that the slice included in FU payload (712) is the last slice of the last NAL unit of the picture.

In an embodiment, a method of encapsulating NAL units into a plurality of RTP packets by an encapsulator according to at least one RTP payload specification may comprise: dividing the NAL unit into a plurality of fragments; each fragment is encapsulated into an RTP packet that includes an FU header that includes R bits. In an embodiment, the R bit may be set by the encapsulator if the NAL unit is the last NAL unit of the coded picture, and cleared otherwise.

In an embodiment, a method of depacketizing NAL units from a plurality of RTP packets according to at least one RTP payload specification by a depacketizer may comprise: decoding the NAL unit; each fragment is unpacked from an RTP packet including an FU header, the FU header including R bits, and a plurality of fragments are assembled into a NAL unit. In an embodiment, the R bit observed by the encapsulator may be equal to one if the NAL unit is the last NAL unit of the coded picture, otherwise equal to zero.

Fig. 8 is a flow diagram of an example method 800 for encapsulating multiple NAL units of a picture using at least one processor. In some implementations, at least one of the method blocks in fig. 8 may be performed by an encapsulator or decapsulator, such as those discussed above. In some implementations, at least one of the method blocks in fig. 8 may be performed by another device or a group of devices, such as the endpoints and MANEs discussed above.

As shown in fig. 8, method 800 may include obtaining a plurality of NAL units including a first NAL unit of a picture and a last NAL unit of the picture (block 810).

As further shown in fig. 8, method 800 may include partitioning a first NAL unit of a picture into a plurality of first slices and partitioning a last NAL unit of the picture into a plurality of last slices (block 820).

As further shown in fig. 8, method 800 may include encapsulating a plurality of first fragments into a plurality of first Fragment Unit (FU) packets and encapsulating a plurality of last fragments into a plurality of last FU packets. In an embodiment, a last FU packet of the plurality of last FU packets may comprise a last FU header comprising a last R-bit and the last R-bit may be set, e.g., to 1 (block 830).

As further shown in fig. 8, method 800 may include transmitting a plurality of first FU packets and a plurality of last FU packets (block 840).

In an embodiment, the plurality of first FU packets and the plurality of last FU packets may comprise Real-time transport protocol (RTP: real-time Transport Protocol) packets.

In an embodiment, a first FU packet of the plurality of first FU packets may comprise a first FU header comprising a first R bit and the first R bit may be set to 0.

In an embodiment, a first FU packet of the plurality of first FU packets may comprise a first FU header comprising a first S-bit and a last FU header may comprise a last S-bit.

In an embodiment, the first S bit may be set to 1 and the last S bit may be set to 0.

In an embodiment, the plurality of NAL units may include an intermediate NAL unit between the first NAL unit and the last NAL unit, the intermediate NAL unit may be partitioned into a plurality of intermediate fragments, and the plurality of intermediate fragments may be encapsulated into a plurality of intermediate FU packets.

In an embodiment, a first FU packet of the plurality of first FU packets may comprise a first FU header comprising first E bits, a last FU packet of the plurality of intermediate FU packets may comprise an intermediate FU header comprising intermediate E bits, and a last FU header may comprise last E bits.

In an embodiment, the first E bit may be set to 0, wherein the intermediate E bit may be set to 1 and the last E bit may be set to 0.

While fig. 8 shows example blocks of the method 800, in some embodiments, the method 800 may include additional blocks, fewer blocks, different blocks, or differently arranged blocks than those depicted in fig. 8. Additionally or alternatively, two or more of the blocks of method 800 may be performed in parallel.

Further, the proposed method may be implemented by a processing circuit (e.g. at least one processor or at least one integrated circuit). In one example, at least one processor executes a program stored in a non-volatile computer-readable medium to perform the at least one method as set forth.

Corresponding to the method 800 described above, the embodiment of the present disclosure further provides an apparatus for encapsulating a plurality of network abstraction layer NAL units of a picture, where the apparatus includes:

an acquisition module configured to acquire the plurality of NAL units, where the plurality of NAL units includes a first NAL unit of the picture and a last NAL unit of the picture;

a partitioning module configured to partition the first NAL unit of the picture into a plurality of first slices and partition the last NAL unit of the picture into a plurality of last slices;

The packaging module is used for packaging the plurality of first fragments into a plurality of first fragment unit FU packages and packaging the plurality of last fragments into a plurality of last FU packages; and

a transmitting module for transmitting the first FU packets and the last FU packets,

wherein a last FU packet of the plurality of last FU packets comprises a last FU header, the last FU header comprising a last R-bit, and the last R-bit being set to 1.

In some embodiments, the plurality of first FU packets and the plurality of last FU packets comprise real-time transport protocol RTP packets.

In some embodiments, a first FU packet of the plurality of first FU packets comprises a first FU header, the first FU header comprises a first R bit, and the first R bit is set to 0.

In some embodiments, a first FU packet of the plurality of first FU packets comprises a first FU header comprising a first S-bit, wherein the last FU header comprises a last S-bit.

In some embodiments, the first S bit is set to 1 and the last S bit is set to 0.

In some embodiments, the plurality of NAL units includes an intermediate NAL unit between the first NAL unit and the last NAL unit, wherein the intermediate NAL unit is partitioned into a plurality of intermediate fragments, wherein the plurality of intermediate fragments are encapsulated into a plurality of intermediate FU packets.

In some embodiments, a first FU packet of the plurality of first FU packets comprises a first FU header comprising a first E-bit, wherein a last FU packet of the plurality of intermediate FU packets comprises an intermediate FU header comprising an intermediate E-bit, wherein the last FU header comprises a last E-bit.

In some embodiments, the first E bit is set to 0, wherein the intermediate E bit is set to 1, and wherein the last E bit is set to 0.

The techniques described above for signaling and identifying picture boundaries in video payload format over an IP network may be implemented as computer software by computer readable instructions and physically stored in one or more computer readable media. For example, FIG. 9 illustrates a computer system (900) suitable for implementing certain embodiments of the disclosed subject matter.

The computer software may be encoded in any suitable machine code or computer language, and code comprising instructions may be created by means of assembly, compilation, linking, etc. mechanisms, the instructions being executable directly by one or more computer Central Processing Units (CPUs), graphics Processing Units (GPUs), etc. or by means of decoding, microcode, etc.

The instructions may be executed on various types of computers or components thereof, including, for example, personal computers, tablet computers, servers, smartphones, gaming devices, internet of things devices, and the like.

The components shown in fig. 9 for computer system (900) are exemplary in nature and are not intended to limit the scope of use or functionality of computer software implementing embodiments of the present application. Nor should the configuration of components be construed as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary embodiment of the computer system (900).

The computer system (900) may include some human interface input devices. Such human interface input devices may be responsive to input from one or more human users via tactile input (e.g., keyboard input, sliding, data glove movement), audio input (e.g., voice, palm sound), visual input (e.g., gestures), olfactory input (not shown). The human interface device may also be used to capture certain media, such as audio (e.g., speech, music, ambient sound), images (e.g., scanned images, photographic images obtained from still-image cameras), video (e.g., two-dimensional video, three-dimensional video including stereoscopic video), and the like, which may not necessarily be directly related to human conscious input.

The human interface input device may include one or more of the following (only one of which is depicted): a keyboard (901), a mouse (902), a touch pad (903), a touch screen (910), a data glove (not shown), a joystick (905), a microphone (906), a scanner (907), a camera (908).

The computer system (900) may also include some human interface output device. Such human interface output devices may stimulate the sensation of one or more human users by, for example, tactile output, sound, light, and smell/taste. Such human-machine interface output devices may include haptic output devices (e.g., haptic feedback via a touch screen (910), data glove (904), or joystick (905), but there may also be haptic feedback devices that do not serve as input devices), audio output devices (e.g., speakers (909), headphones (not shown)), visual output devices (e.g., screens (910) including cathode ray tube screens, liquid crystal screens, plasma screens, organic light emitting diode screens), each with or without touch screen input functionality, each with or without haptic feedback functionality, some of which may output two-dimensional visual output or three-dimensional or more output by means such as stereoscopic output, virtual reality glasses (not shown), holographic displays, and smoke boxes (not shown)), and printers (not shown).

The computer system (900) may also include human-accessible storage devices and their associated media such as optical media including high-density read-only/rewritable compact discs (CD/DVD ROM/RW) (920) with CD/DVD or similar media (921), thumb drive (922), removable hard drive or solid state drive (923), conventional magnetic media such as magnetic tape and floppy disks (not shown), ROM/ASIC/PLD based specialized devices such as secure software protectors (not shown), and so forth.

It should also be appreciated by those skilled in the art that the term "computer-readable medium" as used in connection with the disclosed subject matter does not include transmission media, carrier waves or other transitory signals.

The computer system (900) may also include an interface to one or more communication networks. For example, the network may be wireless, wired, optical. The network may also be a local area network, wide area network, metropolitan area network, in-vehicle and industrial networks, real-time network, delay tolerant network, and so forth. The network also includes local area networks such as ethernet, wireless local area networks, cellular networks (GSM, 3G, 4G, 5G, LTE, etc.), television cable or wireless wide area digital networks (including cable television, satellite television, and terrestrial broadcast television), vehicular and industrial networks (including CANBus), and the like. Some networks typically require an external network interface adapter for connection to some general purpose data port or peripheral bus (949) (e.g., a USB port of a computer system (900)); other systems are typically integrated into the core of the computer system (900) by connecting to a system bus as described below (e.g., an ethernet interface is integrated into a PC computer system or a cellular network interface is integrated into a smart phone computer system). Using any of these networks, the computer system (900) may communicate with other entities. The communication may be unidirectional, for reception only (e.g., wireless television), unidirectional, for transmission only (e.g., CAN bus to certain CAN bus devices), or bidirectional, for example, to other computer systems via a local or wide area digital network. Each of the networks and network interfaces described above may use certain protocols and protocol stacks.

The human interface device, the human accessible storage device, and the network interface described above may be connected to a core (940) of the computer system (900).

The core (940) may include one or more Central Processing Units (CPUs) (941), graphics Processing Units (GPUs) (942), special purpose programmable processing units in the form of Field Programmable Gate Arrays (FPGAs) (943), hardware accelerators (944) for specific tasks, and the like. These devices, as well as Read Only Memory (ROM) (945), random access memory (946), internal mass storage (e.g., internal non-user accessible hard disk drives, solid state drives, etc.) (947), etc., may be connected via a system bus (948). In some computer systems, the system bus (948) may be accessed in the form of one or more physical plugs so as to be expandable by additional central processing units, graphics processing units, and the like. Peripheral devices may be attached directly to the system bus (948) of the core or connected through a peripheral bus (949). The architecture of the peripheral bus includes external controller interfaces PCI, universal serial bus USB, etc.

The CPU (941), GPU (942), FPGA (943) and accelerator (944) may execute certain instructions that, in combination, may constitute the computer code described above. The computer code may be stored in ROM (945) or RAM (946). The transition data may also be stored in RAM (946) while the permanent data may be stored in, for example, internal mass storage (947). Fast storage and retrieval of any memory device may be achieved through the use of a cache memory, which may be closely associated with one or more CPUs (941), GPUs (942), mass storage (947), ROM (945), RAM (946), and the like.

The computer readable medium may have computer code embodied thereon for performing various computer implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present application, or they may be of the kind well known and available to those having skill in the computer software arts.

By way of example, and not limitation, a computer system having an architecture (900), and in particular a core (940), may provide functionality as a processor (including CPU, GPU, FPGA, accelerator, etc.) to execute software embodied in one or more tangible computer readable media. Such computer readable media may be media associated with the mass storage device accessible by the user as described above, as well as specific memory having a non-volatile core (940), such as mass storage device (947) or ROM (945) within the core. Software implementing various embodiments of the present application may be stored in such devices and executed by the core (940). The computer-readable medium may include one or more storage devices or chips according to particular needs. The software may cause the core (940), and in particular the processor therein (including CPU, GPU, FPGA, etc.), to perform certain processes or certain portions of certain processes described herein, including defining data structures stored in RAM (946) and modifying such data structures according to the software-defined processes. Additionally or alternatively, the computer system may provide functionality that is logically hardwired or otherwise contained in circuitry (e.g., accelerator (944)) that may operate in place of or in addition to software to perform certain processes or certain portions of certain processes described herein. References to software may include logic, and vice versa, where appropriate. References to computer readable medium may include circuitry (e.g., an Integrated Circuit (IC)) storing executable software, circuitry containing executable logic, or both, where appropriate. This application includes any suitable combination of hardware and software.

While this application has described a number of exemplary embodiments, various modifications, arrangements, and equivalents of the embodiments are within the scope of this application. It will thus be appreciated that those skilled in the art will be able to devise various arrangements and methods which, although not explicitly shown or described herein, embody the principles of the application and are thus within its spirit and scope.

Claims

1. A method of encapsulating a plurality of network abstraction layer NAL units of a picture, the method comprising:

obtaining the plurality of NAL units, the plurality of NAL units including a first NAL unit of the picture and a last NAL unit of the picture;

splitting the first NAL unit of the picture into a plurality of first slices and splitting the last NAL unit of the picture into a plurality of last slices;

encapsulating the plurality of first fragments into a plurality of first fragment unit FU packets, and encapsulating the plurality of last fragments into a plurality of last FU packets; and

transmitting the plurality of first FU packets and the plurality of last FU packets,

wherein a last FU packet of the plurality of last FU packets includes a last FU header, the last FU header including a last E bit and a last R bit, the last E bit being used to indicate that a slice of the last FU packet is a last slice of a corresponding NAL unit, and

The last R bit is set to 1;

wherein the last R bit is set to 1 for indicating that the slice corresponding to the last FU packet of the plurality of last FU packets is the last slice of the last NAL unit of the picture.

2. The method of claim 1, wherein the plurality of first FU packets and the plurality of last FU packets comprise real-time transport protocol RTP packets.

3. The method of claim 1, wherein a first FU packet of the plurality of first FU packets comprises a first FU header, the first FU header comprising a first R-bit, and

wherein the first R bit is set to 0.

4. The method of claim 1, wherein a first FU packet of the plurality of first FU packets comprises a first FU header, the first FU header comprising a first S-bit,

wherein the last FU header comprises a last S bit.

5. The method of claim 4, wherein the first S bit is set to 1 and the last S bit is set to 0.

6. The method of any of claims 1-5, wherein the plurality of NAL units includes an intermediate NAL unit between the first NAL unit and the last NAL unit,

Wherein the intermediate NAL unit is partitioned into a plurality of intermediate fragments,

wherein the plurality of intermediate fragments are encapsulated into a plurality of intermediate FU packets.

7. The method of claim 6, wherein a first FU packet of the plurality of first FU packets comprises a first FU header, the first FU header comprising a first E-bit,

wherein a last FU packet of the plurality of intermediate FU packets comprises an intermediate FU header, the intermediate FU header comprising intermediate E bits,

wherein the last FU header comprises the last E bits.

8. The method of claim 7, wherein the first E bit is set to 0,

wherein the intermediate E bit is set to 1, and

wherein the last E bit is set to 0.

9. An apparatus for encapsulating a plurality of network abstraction layer NAL units for a picture, the apparatus comprising:

an acquisition module for causing at least one processor to acquire the plurality of NAL units, the plurality of NAL units including a first NAL unit of the picture and a last NAL unit of the picture;

a partitioning module for causing the at least one processor to partition the first NAL unit of the picture into a plurality of first segments and partition the last NAL unit of the picture into a plurality of last segments;

An encapsulation module configured to cause the at least one processor to encapsulate the plurality of first fragments into a plurality of first fragment unit FU packages and encapsulate the plurality of last fragments into a plurality of last FU packages; and

a transmitting module for causing the at least one processor to transmit the plurality of first FU packets and the plurality of last FU packets,

wherein a last FU packet of the plurality of last FU packets comprises a last FU header comprising a last E-bit and a last R-bit, the last E-bit being used to indicate that a slice of the last FU packet is a last slice of a corresponding NAL unit, and

the last R bit is set to 1;

10. The apparatus of claim 9, wherein the plurality of first FU packets and the plurality of last FU packets comprise real-time transport protocol RTP packets.

11. The apparatus of claim 9, wherein a first FU packet of the plurality of first FU packets comprises a first FU header, the first FU header comprising a first R-bit, and

Wherein the first R bit is set to 0.

12. The apparatus of claim 9, wherein a first FU packet of the plurality of first FU packets comprises a first FU header, the first FU header comprising a first S-bit,

wherein the last FU header comprises a last S bit.

13. The apparatus of claim 9, wherein the plurality of NAL units includes an intermediate NAL unit between the first NAL unit and the last NAL unit,

14. A computer device comprising a processor and a memory; the memory stores a computer program which, when executed by the processor, causes the processor to perform the method of any one of claims 1 to 8.

15. A non-transitory computer-readable medium storing instructions, the instructions comprising: at least one instruction that, when executed by at least one processor of a device that encapsulates a plurality of network abstraction layer NAL units of a picture, causes the at least one processor to perform the method of any one of claims 1 to 8.