CN114448955B

CN114448955B - Digital audio network transmission method, device, equipment and storage medium

Info

Publication number: CN114448955B
Application number: CN202111678518.7A
Authority: CN
Inventors: 吴健
Original assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Current assignee: Saiyinxin Micro Beijing Electronic Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2024-02-02
Anticipated expiration: 2041-12-31
Also published as: CN114448955A

Abstract

The application relates to a digital audio network transmission method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a subframe of a data burst of an uncompressed Pulse Code Modulation (PCM) digital audio signal, and encapsulating the subframe in an RTP payload field according to an encapsulation format of a real-time transport protocol (RTP); technical metadata required by RTP stream is received and analyzed, communication is carried out through a session description protocol SDP, and the PCM digital audio signal is transmitted through a preset serial digital audio interface; wherein, the field of RTP header accords with the specification of RTP fixed header field, RTP payload is composed of interleaving set of preset flow format subframe sequence; the media clock and the RTP clock meet the requirements of SMPTE, and the rates of the media clock and the RTP clock are the same as the PCM digital audio sampling rate. The technical scheme of the application can be used for carrying out the real-time transmission of the PCM digital audio stream through the IP network.

Description

Digital audio network transmission method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data transmission technologies, and in particular, to a method, an apparatus, a device, and a storage medium for digital audio network transmission.

Background

With the development of technology, audio has become more and more complex. From early mono audio to stereo, the center of gravity also focuses on the correct way of processing the left and right channels. But after the surround sound appears, the process starts to become complicated. The surround 5.1 speaker system performs a sequence constraint on the multiple channels, so that the surround 6.1 speaker system, the surround 7.1 speaker system, etc. change the audio processing, and transmit the correct signals to the appropriate speakers to form a mutual effect. Thus, as sounds become more immersive and interactive, the complexity of audio processing increases greatly.

An audio channel (or soundtrack) refers to mutually independent audio signals that are collected or played back at different spatial locations of sound when recorded or played. The number of channels is the number of sound sources when recording sound or the corresponding number of loudspeakers when playing back. For example, in a surround 5.1 speaker system, including 6 audio signals at different spatial locations, each individual audio signal is used to drive a speaker at a corresponding spatial location; in a surround 7.1 speaker system, 8 audio signals at different spatial locations are included, each individual audio signal being used to drive a speaker at a corresponding spatial location.

The transmission capabilities and capacities of IP network devices have steadily increased, enabling IP switching and routing techniques to transmit and switch video, audio and metadata within professional television broadcast equipment. Existing standards have been applied in this regard but still require element manipulation to distinguish between different properties.

Disclosure of Invention

The invention aims to provide a digital audio network transmission method, a device, equipment and a storage medium, so as to transmit PCM digital audio streams in real time through an IP network.

The first aspect of the present application provides a digital audio network transmission method, including:

acquiring subframes of a data burst of an uncompressed pulse code modulated (Pulse Code Modulation, PCM) digital audio signal and encapsulating the subframes in an RTP payload field according to an encapsulation format of a Real-time transport protocol (Real-time Transport Protocol, RTP);

technical metadata required for receiving and parsing the RTP stream, communicating via a session description protocol (Session Description Protocol, SDP), transmitting the PCM digital audio signal via a preset serial digital audio interface;

wherein, the field of RTP header accords with the specification of RTP fixed header field, RTP payload is composed of interleaving set of preset flow format subframe sequence; the media clock and the RTP clock meet the requirements of SMPTE, and the rates of the media clock and the RTP clock are the same as the PCM digital audio sampling rate.

A second aspect of the present application provides a digital audio network transmission apparatus, including:

the acquisition module is used for acquiring subframes of data bursts of the uncompressed pulse code modulation PCM digital audio signal and encapsulating the subframes in an RTP payload field according to an encapsulation format of a real-time transmission protocol RTP;

the transmission module is used for receiving and analyzing technical metadata required by the RTP stream, communicating through a session description protocol SDP and transmitting the PCM digital audio signal through a preset serial digital audio interface;

A third aspect of the present application provides an electronic device, comprising: a memory and one or more processors;

the memory is used for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the digital audio network transmission method as provided in any of the embodiments.

A fourth aspect of the present application provides a storage medium containing computer-executable instructions that, when implemented by a computer processor, implement a digital audio network transmission method as provided in any of the embodiments.

From the above, the digital audio network transmission method of the present application performs real-time transmission of the PCM digital audio stream based on RTP through the IP network.

Drawings

FIG. 1 is a schematic diagram of a three-dimensional acoustic audio model according to an embodiment of the present application;

fig. 2 is a flowchart of a digital audio network transmission method in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a digital audio network transmission device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Examples

As shown in fig. 1, the three-dimensional acoustic audio model is composed of a set of elements, each element describing one phase of audio, and includes a content producing section and a format producing section.

Wherein the content portion includes: audio program elements, audio content elements, audio object elements, and audio unique identification elements; the format making section includes: an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element;

said audio program element referencing at least one of said audio content elements; the audio content element referencing at least one audio object element; the audio object element references the corresponding audio package format element and the corresponding audio track unique identification element; the audio track unique identification element references the corresponding audio track format element and the corresponding audio packet format element;

said audio packet format element referencing at least one of said audio channel format elements; the audio stream format element references the corresponding audio channel format element and the corresponding audio packet format element; the audio track format element and the corresponding audio stream format element are referenced to each other. The reference relationships between elements are indicated by arrows in fig. 1.

Audio programming may include, but is not limited to, narration, sound effects, and background music, and the audio programming elements may be used to describe programming that includes at least one content, and the audio content elements are used to describe a corresponding one of the audio programming elements. The audio program elements may reference one or more audio content elements that are combined together to construct a complete audio program element.

The audio content element describes the content of a component of the audio program (e.g., background music) and references one or more audio object elements to relate the content to its format.

The audio object elements are used to build content, format and valuable information and to determine the track unique identification of the actual track.

The format making section includes: an audio packet format element, an audio channel format element, an audio stream format element, and an audio track format element.

The audio packet format element may be used to describe the format in which the audio object element and the original audio data are packed in accordance with a channel packet.

The audio channel format element may be used to represent a single sequence of audio samples and preset operations performed thereon, such as rendering movement of objects in a scene. The audio channel format element may contain at least one audio block format element. The audio block format element may be regarded as a sub-element of the audio channel format element, so that there is an inclusion relationship between the audio channel format element and the audio block format element.

An audio stream is a combination of audio tracks required to render channels, objects, higher order ambient sound components or packages. The audio stream format element is used for establishing a relation between the audio track format element set and the audio channel format element set or a relation between the audio track format set and the audio packet format.

The audio track format elements correspond to a set of samples or data in a single track, are used to describe the format of the original audio data, and the decoded signal of the renderer, and are also used to identify the track combination required to successfully decode the track data.

And generating synthesized audio data containing metadata after the original audio data are manufactured through the three-dimensional audio model.

The Metadata (Metadata) is information describing characteristics of data, and functions supported by the Metadata include indicating storage locations, history data, resource lookups, or file records.

After the synthesized audio data is transmitted to the far end in a communication mode, the far end analyzes the synthesized audio data based on the metadata, and the original sound scene is restored or rendered into a new sound scene in real time.

The division between the content creation section, the format creation section and the BW64 (Broadcast Wave-64bit, 64-bit Broadcast Wave) file is shown in fig. 1. Both the content production portion and the format production portion constitute metadata in XML format, which is typically contained in a block ("axml" block) of the BW64 file. The BW64 file section at the bottom contains a "channel assignment (chna)" block, which is a lookup table used to link metadata to audio programs in the file.

The content production section describes the technical content of the audio, for example whether it contains dialog or a specific language, and loudness metadata. The format section describes the channel types of the audio track and how they are combined, e.g. left and right channels in a stereo pair. The meta-cable of the content production portion is typically unique to the audio and program, while the elements of the format production portion may be multiplexed.

The audio model is an open compatible metadata generic model, but the audio model metadata is not suitable for real-time production and streaming audio applications, but rather for local file storage. When remote real-time transfer of metadata with digital audio is involved, a serial audio metadata schema is required to allow existing audio and its associated audio model metadata files to be sliced into frames and streamed.

A frame of serial audio metadata contains a set of audio model metadata describing the audio frames associated with the frame for a certain period of time. The serial audio metadata has the same structure, properties, and elements as the audio model metadata, and additional properties for specifying a frame format. The serial audio metadata frames do not overlap and are connected to a specified start time and duration. Metadata contained in a serial audio metadata frame may be used to describe the audio itself beyond the duration of the frame.

The parent element of the serial audio metadata is a frame (frame) comprising: frame header (frame header) and audio format extension (audioformat extended). And the frame header includes 2 sub-elements: frame format (frameFormat) and transport track format (transport format).

The audio format extension includes 8 sub-elements: an audio program (audioprogram), an audio content (audioContent), an audio object (audioObject), an audio track unique identifier (audioTrackUID), an audio packet format (audioPackFormat), an audio channel format (audioChannelFormat), an audio stream format (audioStreamFormat), and an audio track format (audioTrackFormat).

The audio model metadata is composed of a content portion (e.g., audio program elements) and a format portion (e.g., audio channel format elements). Only three elements, audio program elements, audio object elements and audio block format elements, have time-dependent parameters stored. In the content part, the start time, end time and duration of an audio program element or audio object element are used to determine the start time, end time or duration of the element, these parameters typically being fixed. In the format part, all parameters in the audio block format element are time-varying parameters.

The audio model metadata can be divided into two groups: namely dynamic metadata (e.g., audio block format elements in an audio channel format element) and static metadata (e.g., audio program elements and audio content elements).

A serial audio metadata frame consists of one or more metadata blocks.

The application provides a digital audio network transmission method, as shown in fig. 2, which comprises the following steps:

s210, acquiring a subframe of a data burst of an uncompressed PCM digital audio signal, and encapsulating the subframe in an RTP payload field according to an RTP encapsulation format;

s220, receiving and analyzing technical metadata required by the RTP stream, communicating through SDP, and transmitting the PCM digital audio signal through a preset serial digital audio interface;

Optionally, the digital audio transmitter conforming to the preset serial digital audio interface supports a digital audio sampling rate of 48kHz, and does not support digital audio sampling rates of 44.1kHz and 96 kHz; the digital audio receiver corresponding to the digital audio transmitter supports digital audio sampling rates of 48kHz and does not support digital audio sampling rates of 44.1kHz and 96 kHz.

Optionally, the offset between the media clock and the RTP clock is zero.

Optionally, the digital audio stream of the digital audio signal conforms to AES67 protocol and session description protocol SDP;

the digital audio transmitter and the corresponding digital audio receiver comply with the specifications of the payload format and the sampling rate of AES 67;

the digital audio transmitter and the corresponding digital audio receiver comply with transmitter timing and receiver buffering timing specifications of AES 67;

the preset serial digital audio interface uses standard UDP datagram size restrictions.

Optionally, the communicating through session description protocol SDP, transmitting the PCM digital audio signal through a preset serial digital audio interface, includes:

if a channel order signal is issued in the session description protocol SDP, the syntax of IETF RFC3190 is used as the parameter channel order; the protocol of the channel order conforms to SMPTE2110.

The preset serial digital audio interface in this embodiment specifies that the RTP-based PCM digital audio stream is transmitted in real time over an IP network by referring to AES 67. An SDP-based signaling method is defined for receiving and parsing metadata required for the stream. For convenience of description, the preset serial digital audio interface is hereinafter referred to as the present interface.

For uncompressed pulse code modulated digital audio real time transport protocol format, media clock, RTP clock and timestamp references are set as follows.

The media clock and the RTP clock meet the SMPTE requirement. The rate of the media clock and the RTP clock should be the same as the digital audio sampling rate. All PCM digital audio transmitters and PCM digital audio receivers consistent with the present interface should support digital audio sample rates of 48kHz, not 44.1kHz and 96 kHz. Other sampling rates fall outside of range.

All senders and receivers that meet the level above the a level specified by the consistency level in table 1 should support the sample rate, packet time and number of channels specified for their level in table 1.

TABLE 1

The offset between the media clock and the RTP clock is zero. To maintain compatibility with other RTP implementations (e.g., AES 67), the practitioner should note the offset specification in these RTP standards, as well as the possibility that the RTP clock may be offset from the media clock.

For interface constraints, the digital audio stream must conform to AES67, including SDP, subject to the definition constraints in this embodiment. Although AES67 is specified in relation to session initiation protocol (Session Initiation Protocol, SIP), the digital audio receiver need not support SIP or any other specific method for connection management mentioned in AES 67.

The digital audio transmitter and the digital audio receiver must comply with AES67 "payload format and sampling rate" specifications.

The digital audio transmitter and digital audio receiver must comply with AES67 "transmitter timing and receiver buffering" timing specifications.

Standard user datagram protocol (User Datagram Protocol, UDP) datagram size limitations must be used.

Regarding channel order convention, if a channel order signal is issued in SDP, the syntax of IETF RFC3190 is used as the parameter channel order. The < continuity > of the channel order must conform to SMPTE2110. In the channel order convention, < order > shall be the list of channel group symbols contained in brackets and separated by commas. The sign of the channel packet and the number and order of audio channels in the channel packet should meet the specifications of table 2.

Such as:

a＝fmtp:101channel-order＝SMPTE2110.(51,ST)

the first six channels are defined in this example as the 5.1 surround sound group and the second two channels as the stereo group.

And the following steps:

a＝fmtp:101channel-order＝SMPTE2110.(M,M,M,M,ST,U02)

this example defines the first four channels as mono, the last two channels as stereo set and the last two channels as undefined two channel sets.

Any audio channels that do not match the number or order of sound field groups defined in table 2 should be identified as outstanding and grouped. Undeffered is a special name used to represent groups of channels whose mix is unknown or unrepresentable, or does not conform to other defined groupings and orders in the convention. The outstanding channel group symbol shall be "U" followed by a two-bit integer, leading to zero if needed, which shall define the number of channels carried in the outstanding channel group. For example, "U05" refers to a set of five outstanding channels.

If channel order parameters do not exist, the audio channel should be considered undefined. If the number of channels declared in the channel order parameter does not match the actual number of channels in the stream, audio channels not declared in the channel order parameter should be considered undefined.

The channel packet symbols should be as shown in table 2. The audio channel names and symbols used in table 2 are as defined in SMPTE ST 2067-8, except for the 22.2 channel packets and undefined channel packets defined in the present interface. The 22.2 channel packet has a channel name and order according to table 2. The undefined channel group in table 2 is not a sound field group but is used to indicate that the reference channels have no specific group.

TABLE 2

The development of the above-described channel sequence makes it possible to clearly define the channels of a phase-coherent multi-channel audio set (or simply mono) in SDP. But does not provide any "higher level" audio channels (e.g. "second language") for purposes of description. Only the fixed multi-channel audio groups commonly used in this embodiment are processed, and no object-based audio is processed.

For a consistency level, a digital audio receiver conforming to the present interface must at least meet the class a requirements in table 3. The digital audio receiver according to this embodiment should specify the consistency levels of table 3, which should be supported when this interface is indicated to be met.

TABLE 3 Table 3

The SDI allows carrying at least 16 embedded audio channels. A sender desiring to stay consistent with class a may be enabled by sending a channel group into multiple AES67 streams, ideally grouping the channel group into logical channel groups.

Fig. 3 is a schematic diagram of a digital audio network transmission device according to an embodiment of the present application, including:

an acquisition module 310, configured to acquire a subframe of a data burst of an uncompressed pulse code modulated PCM digital audio signal, and encapsulate the subframe in an RTP payload field according to an encapsulation format of a real-time transport protocol RTP;

a transmission module 320, configured to receive and parse technical metadata required by the RTP stream, communicate through a session description protocol SDP, and transmit the PCM digital audio signal through a preset serial digital audio interface;

Optionally, the offset between the media clock and the RTP clock is zero.

Optionally, the transmission module 320 is specifically configured to:

The digital audio network transmission device provided by the embodiment of the invention can execute the digital audio network transmission method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 4, the electronic device includes: processor 410, memory 420, input device 430, and output device 440. The number of processors 30 in the electronic device may be one or more, one processor 410 being illustrated in fig. 4. The number of memories 420 in the electronic device may be one or more, one memory 420 being exemplified in fig. 4. The processor 410, memory 420, input device 430, and output device 440 of the electronic device may be connected by a bus or other means, for example by a bus connection in fig. 4. The electronic device may be a computer, a server, etc. In the embodiment of the application, the electronic device is used as a server for detailed description, and the server can be an independent server or a cluster server.

Memory 420 acts as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as the program instructions/modules of the digital audio network transmission device described in any of the embodiments of the present application. Memory 420 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the device, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 430 may be used to receive input digital or character information and to generate key signal inputs related to viewer user settings and function control of the electronic device, as well as cameras for capturing images and pickup devices for capturing audio data. The output 440 may include an audio device such as a speaker. It should be noted that the specific composition of the input device 430 and the output device 440 may be set according to the actual situation.

The processor 410 performs various functional applications of the device and data processing, i.e., implements a digital audio network transmission method, by running software programs, instructions, and modules stored in the memory 420.

Embodiments of the present application also provide a storage medium containing computer-executable instructions that, when generated by a computer processor, comprise the digital audio network transmission method provided by any of the embodiments.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present application is not limited to the operations of the electronic method described above, but may also perform the related operations in the electronic method provided in any embodiment of the present application, and has corresponding functions and beneficial effects.

From the above description of embodiments, it will be clear to a person skilled in the art that the present application may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a robot, a personal computer, a server, or a network device, etc.) to execute the electronic method according to any embodiment of the present application.

It should be noted that, in the above electronic device, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present application.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, reference to the term "in one embodiment," "in another embodiment," "exemplary," or "in a particular embodiment," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While the application has been described in detail with respect to the general description, the detailed description and the experiments, it will be apparent to those skilled in the art that modifications or improvements can be made thereto based on the application. Accordingly, such modifications or improvements may be made without departing from the spirit of the application and are intended to be within the scope of the invention as claimed.

Claims

1. A digital audio network transmission method, comprising:

acquiring a subframe of a data burst of an uncompressed Pulse Code Modulation (PCM) digital audio signal, and encapsulating the subframe in an RTP payload field according to an encapsulation format of a real-time transport protocol (RTP);

technical metadata required by RTP stream is received and analyzed, communication is carried out through a session description protocol SDP, and the PCM digital audio signal is transmitted through a preset serial digital audio interface;

wherein, the field of RTP header accords with the specification of RTP fixed header field, RTP payload is composed of interleaving set of preset flow format subframe sequence; the media clock and the RTP clock meet the requirements of SMPTE, and the rates of the media clock and the RTP clock are the same as the PCM digital audio sampling rate;

wherein the communication through session description protocol SDP, the PCM digital audio signal transmission through a preset serial digital audio interface, comprises:

2. The method of claim 1, wherein a digital audio transmitter conforming to the preset serial digital audio interface supports digital audio sample rates of 48kHz and not 44.1kHz and 96 kHz; the digital audio receiver corresponding to the digital audio transmitter supports digital audio sampling rates of 48kHz and does not support digital audio sampling rates of 44.1kHz and 96 kHz.

3. The method of claim 1, wherein the offset between the media clock and the RTP clock is zero.

4. The method according to claim 2, wherein the digital audio stream of the digital audio signal conforms to AES67 protocol and session description protocol SDP;

5. A digital audio network transmission apparatus, comprising:

the transmission module is specifically configured to:

6. The apparatus of claim 5, wherein a digital audio transmitter conforming to the preset serial digital audio interface supports digital audio sample rates of 48kHz and not 44.1kHz and 96 kHz; the digital audio receiver corresponding to the digital audio transmitter supports digital audio sampling rates of 48kHz and does not support digital audio sampling rates of 44.1kHz and 96 kHz.

7. The apparatus of claim 5, wherein an offset between the media clock and the RTP clock is zero.

8. An electronic device, comprising: a memory and one or more processors;

the memory is used for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-4.

9. A storage medium containing computer executable instructions which, when executed by a computer processor, implement the method of any one of claims 1-4.