CN113114688A

CN113114688A - Multimedia conference management method and device, storage medium and electronic equipment

Info

Publication number: CN113114688A
Application number: CN202110404403.2A
Authority: CN
Inventors: 崔承宗; 阮良; 陈功; 陈洁
Original assignee: Hangzhou Langhe Technology Co Ltd
Current assignee: Hangzhou Netease Zhiqi Technology Co Ltd
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-07-13
Anticipated expiration: 2041-04-15
Also published as: CN113114688B

Abstract

The embodiment of the disclosure relates to the technical field of computers, and more particularly, to a multimedia conference management method and apparatus, a storage medium and an electronic device. The method comprises the following steps: joining the multimedia conference based on the conference management information; responding to multimedia data transmission channel parameter information from a signaling server, generating initialization response information and sending the initialization response information to the signaling server; and creating a target number of audio receiving channels based on the multimedia data transmission channel parameter information for receiving audio stream data in the multimedia conference by a media server through the target number of audio receiving channels. The method and the device for receiving the audio stream data of the participants create a fixed number of audio stream receiving channels for the participants of the multimedia conference to receive the audio stream data of different participants, avoid receiving end resource waste caused by creating too many audio receiving channels, further improve the performance of the receiving end, and shorten the waiting time for the receiving end to listen to the audio.

Description

Multimedia conference management method and device, storage medium and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computers, and more particularly, to a multimedia conference management method and apparatus, a storage medium and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims and the description herein is not admitted to be prior art by inclusion in this section.

With the rapid development of internet technology, users can hold audio conferences or video conferences participated by multiple persons through the internet, and real-time multimedia conferences participated by multiple terminals are realized. In some technologies, a receiver participating in a multimedia conference subscribes media streams of all other participants in the conference and creates a corresponding data stream receiving channel, so that the multimedia conference needs a large amount of information interaction; therefore, the time consumed by the receiver for analysis is huge, and the time for receiving and playing the data stream is prolonged.

Disclosure of Invention

In this context, embodiments of the present disclosure are intended to provide a multimedia conference management method and apparatus, a storage medium, and an electronic device.

According to an aspect of the present disclosure, there is provided a multimedia conference management method, including:

joining the multimedia conference based on the conference management information;

responding to multimedia data transmission channel parameter information from a signaling server, generating initialization response information and sending the initialization response information to the signaling server; and

creating a target number of audio receiving channels based on the multimedia data transmission channel parameter information for receiving audio stream data in the multimedia conference by a media server through the target number of audio receiving channels.

In an exemplary embodiment of the disclosure, the receiving, by a media server, audio stream data in the multimedia conference through a target number of the audio receiving channels includes:

and receiving corresponding audio stream data with the same number as the target number forwarded by the media server through the audio receiving channels with the target number based on the energy of each path of audio stream data in the multimedia conference.

In an exemplary embodiment of the present disclosure, the target number of audio receiving channels is the same as the number of users in the active speaker list in the media server.

In an exemplary embodiment of the present disclosure, the method further comprises:

and acquiring user information of each participant of the multimedia conference from the signaling server, and storing the user information into a first mapping relation table.

receiving a first protocol message through the audio receiving channel;

analyzing the first protocol message to acquire information source identification information of a message sender and channel identification information of the audio receiving channel;

and constructing a mapping relation based on the information source identification information and the channel identification information and storing the mapping relation in a second mapping relation table.

receiving a second protocol message through the audio receiving channel;

and analyzing the second protocol message to acquire special information source identification information CSRC, a real-time transmission protocol RTP and a network time protocol NTP, and constructing and storing a corresponding mapping relation in a third mapping relation table.

and performing sound and picture synchronization processing on the received audio stream data and video stream data based on the first mapping relation table, the second mapping relation table and the third mapping relation table.

analyzing the received audio message to obtain a corresponding cut stream extension field;

and when the change of the stream switching extension field is identified, judging that the stream switching operation of the media server occurs.

In an exemplary embodiment of the present disclosure, when it is determined that the media server has a streaming cut operation, the method further includes:

performing a reset operation on the jitter buffer; and

and performing emptying operation on the packet loss retransmission list.

In an exemplary embodiment of the present disclosure, each of the audio receiving channels is configured with a data statistics channel; the method further comprises the following steps:

and monitoring audio stream data corresponding to each participant in the multimedia conference through each data statistical channel so as to obtain audio characteristic data of each participant.

and responding to the user exit indication information from the signaling server, calling a first clearing interface for modifying the first mapping relation table, and clearing the user information of the exiting user.

and responding to the exit indication information of the local user, and calling a second clearing interface to clear the first mapping relation table.

According to an aspect of the present disclosure, there is provided a multimedia conference management apparatus including:

the conference response module is used for joining the multimedia conference based on the conference management information;

the response module is used for responding to the multimedia data transmission channel parameter information from the signaling server, generating initialization response information and sending the initialization response information to the signaling server;

and the channel establishing module is used for establishing a target number of audio receiving channels based on the multimedia data transmission channel parameter information, so as to receive the audio stream data in the multimedia conference by a media server through the target number of audio receiving channels.

In an exemplary embodiment of the present disclosure, the channel establishing module includes:

and the audio stream data selection unit is used for receiving the corresponding audio stream data which is forwarded by the media server and has the same number as the target number through the audio receiving channels with the target number based on the energy of each path of audio stream data in the multimedia conference.

In an exemplary embodiment of the present disclosure, the apparatus further includes:

and the first mapping relation processing module is used for acquiring the user information of each participant of the multimedia conference from the signaling server and storing the user information into a first mapping relation table.

the first protocol message processing module is used for receiving a first protocol message through the audio receiving channel; analyzing the first protocol message to acquire information source identification information of a message sender and channel identification information of the audio receiving channel;

and the second mapping relation processing module is used for constructing a mapping relation based on the information source identification information and the channel identification information and storing the mapping relation in a second mapping relation table.

the third mapping relation processing module is used for receiving a second protocol message through the audio receiving channel;

and the sound and picture synchronous processing module is used for carrying out sound and picture synchronous processing on the received audio stream data and video stream data based on the first mapping relation table, the second mapping relation table and the third mapping relation table.

the flow switching inspection module is used for analyzing the received audio message to obtain a corresponding flow switching extension field; and when the switching expansion field is identified to be changed, judging that the media server has a switching operation.

the stream switching management module is used for executing reset operation on the jitter buffer when judging that the stream switching operation occurs to the media server; and performing emptying operation on the packet loss retransmission list.

the statistical channel configuration module is used for respectively configuring a data statistical channel for each audio receiving channel;

and the channel monitoring module is used for monitoring audio stream data corresponding to each participant in the multimedia conference through each data statistical channel so as to obtain audio characteristic data of each participant.

and the first clearing response module is used for responding to the user exit indication information from the signaling server, calling a first clearing interface to modify the first mapping relation table and clearing the user information of the exiting user.

and the second clearing response module is used for responding to the exit indication information of the local user and calling a second clearing interface to clear the first mapping relation table.

According to an aspect of the present disclosure, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, performs the above-described multimedia conference management method.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any one of the above multimedia conference management methods via execution of the executable instructions.

According to the multimedia conference management method and device of the embodiment, when a user joins a multimedia conference, the signaling server receives parameter information of multimedia data transmission channels, and creates a target number of audio receiving channels based on the parameter information, so that audio stream data in the multimedia conference can be received through a fixed number of audio receiving channels. The fixed number of audio stream receiving channels are used for receiving the audio stream data of different participants, so that each receiving terminal in the multimedia conference does not need to subscribe the audio stream of each participant, the waste of receiving terminal resources caused by the creation of too many audio receiving channels is avoided, the performance of the receiving terminal is improved, and the waiting time for the receiving terminal to listen to the audio is shortened.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows a schematic diagram of a multimedia conference management method according to an embodiment of the present disclosure;

fig. 2 schematically illustrates an exemplary system architecture diagram according to an embodiment of the present disclosure;

fig. 3 schematically shows a flowchart of a method of parsing a first protocol packet according to an embodiment of the present disclosure;

fig. 4 schematically shows a flowchart of a method of parsing a second protocol packet according to an embodiment of the present disclosure;

fig. 5 schematically illustrates a flow chart of a method of parsing a cutstream extension field according to an embodiment of the present disclosure;

fig. 6 schematically shows a frame structure diagram of a user terminal according to an embodiment of the present disclosure;

fig. 7 schematically shows a block diagram of a multimedia conference management apparatus according to an embodiment of the present disclosure;

fig. 8 schematically shows a block diagram of a multimedia conference management apparatus according to an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure; and

fig. 10 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the present disclosure, a multimedia conference management method, a multimedia conference management apparatus, a storage medium, and an electronic device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

The inventor finds that, in some technologies, WebRTC (Web Real-Time Communication) is used as a Real-Time audio and video open-source framework, and developers can realize an audio and video call function based on the WebRTC. Generally, the WebRTC may include QoS modules for audio and video acquisition and playing, encoding and decoding, front-back processing, network transmission, and packet loss resistance, jitter resistance, and the like, and is a relatively mature client real-time audio and video framework. The audio jitter buffer module name of WebRTC is NetEQ, and can dynamically adjust the buffer depth according to network conditions. The NetEQ comprises three main modules, namely network delay prediction, audio acceleration and deceleration and audio packet loss compensation, can obtain extremely low delay under the condition that a network is relatively stable, and obtain the effect of smooth audio under the condition that the network is relatively fluctuated, thereby well balancing the user experience of low delay and smooth audio. The audio and video conference based on the WebRTC frame needs to exchange the media capabilities of a sending end and a receiving end at the beginning of conference establishment. In the prior art, the SDP (Session Description Protocol) is generally used for media capability exchange. With the increase of the number of the conference participants in the audio conference, the number of the SDP information and the created audio receiving channels which the local party needs to interact with is increased. This puts a tremendous strain on the performance and memory resource usage of the receiving end. Is not beneficial to the realization of the multi-person audio conference. The media capability negotiation of the audio and video conference based on the SDP protocol is the preferred scheme of most video conference products, and has the advantages of good universality, high flexibility and convenient maintenance. The SDP protocol is divided into offer and answer in the actual media capability interaction process. The SDP offer is a collection of media capabilities sent to the opposite party when the client initiates a call, and carries information such as the audio and video of the party, the network port, and the like. After the opposite side receives the SDP offer of the own side, an SDP answer is formed according to the media capability supported by the own side and the network port information, and the SDP answer is returned to the own side. Through the interaction, the two parties of the call can acquire the media capability and the supported network port information of the other party, so that the media capability and the network port which are suitable for the other party can be determined, and the media data transmission is started.

However, in a multi-party audio conference scenario, each receiver needs to subscribe to the audio streams of all other participants in the conference and create a corresponding audio receiving channel. But due to the existence of ASL (Active Speaker List), the transit server will only forward the N audio streams with the highest energy (N is generally equal to 3) to the receiving party. Therefore, when the receiving side performs SDP interaction, too much invalid audio information is set in the WebRTC, and too much useless audio receiving channel resources are created.

Especially in large audio conferences (more than 100 participants), a large number of SDP information interactions are required. If the present party joins the conference after the other participants, the SDP answer of the present party will include audio information of all the remaining participants. The time for analyzing the SDP answer is huge, the audio receiving and playing time is prolonged, and the time for hearing the sound after the user of the local party is in a meeting is too long.

In view of the above, the basic idea of the present disclosure is: when each participant joins in the multimedia conference, a fixed number of audio data receiving channels are established according to multimedia data transmission channel parameter information sent by the signaling server, and the channels are utilized to receive audio data, so that excessive and useless audio receiving resources are prevented from being established at a receiving end in the multi-party audio conference, and the performance of the receiving end in analyzing an SDP protocol is improved.

Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.

Exemplary method

A multimedia conference management method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 1.

Referring to fig. 2, a schematic diagram of an exemplary system architecture to which the technical aspects of the embodiments of the present disclosure may be applied is shown. As shown in fig. 2, the system architecture may include a terminal device (e.g., one or more of a smartphone 2041, a tablet 2042, and a computer 2043 shown in fig. 2), a network 203, and a signaling server 201, a media server 202. The network 203 serves as a medium for providing communication links between terminal devices and servers. Network 203 may include various connection types such as wired communication links, wireless communication links, and so forth.

It should be understood that the number of terminal devices, networks, and servers in fig. 2 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the media server 202 may be a server cluster composed of a plurality of servers, and the like.

In an exemplary embodiment of the present disclosure, a user may join a multimedia conference on a terminal device, and the terminal device joins the multimedia conference by interacting with the signaling server 201 and obtains a forwarded audio data stream and a forwarded video data stream from the media server 202. It should be noted that the multimedia data management method provided by the present disclosure may be executed in a user terminal, and is implemented by cooperation between the user terminal and a server. The terminal device with higher computing power can also be used to implement the function of the server to forward the streaming media data and the control signaling.

Referring to fig. 1, the multimedia conference management method may include the steps of:

s1, adding the multimedia conference based on the conference management information;

s2, responding to the multimedia data transmission channel parameter information from the signaling server, generating initialization response information and sending the initialization response information to the signaling server; and

s3, creating a target number of audio receiving channels based on the multimedia data transmission channel parameter information, and receiving the audio stream data in the multimedia conference by the media server through the target number of audio receiving channels.

In the multimedia conference management method according to the embodiment, the participant terminal receives the parameter information of the multimedia data transmission channels through the signaling server, and creates a target number of audio receiving channels based on the parameter information, so that the audio stream data in the multimedia conference can be received through the target number of audio receiving channels. The fixed number of audio stream receiving channels are used for receiving the audio stream data of different participants, so that the receiving end does not need to subscribe the audio stream of each participant in the multimedia conference, the waste of receiving end resources caused by the creation of too many audio receiving channels is avoided, the performance of the receiving end is improved, and the waiting time for the receiving end to listen to the audio is shortened.

Specifically, in an exemplary multimedia conference management method of the present disclosure:

in step S1, the multimedia conference is joined based on the conference management information.

In an exemplary embodiment of the present disclosure, the multimedia conference may be an audio conference or an audio-video conference. For the terminal side, the conference management information may be multimedia conference joining invitation information of the multimedia conference sent by the conference initiator, such as a conference link address, a two-dimensional code, and the like. The receiving end user can send response information for indicating that the terminal side participates in the multimedia conference to the server side (such as a signaling server) in a mode of clicking a link or scanning a two-dimensional code on the terminal side, and join in the multimedia conference. In addition, for the signaling server, after the user terminal joins the multimedia experience, the user information can be stored in the signaling server as the participant information.

In step S2, in response to the multimedia data transmission channel parameter information from the signaling server, an initialization response message is generated and sent to the signaling server.

In an exemplary embodiment of the present disclosure, for the signaling server, the signaling server may send the multimedia data transmission channel parameter information as SDP offer information to the terminals joining the multimedia conference. After receiving the multimedia data transmission channel parameter information, the user terminal can generate corresponding initialization response information as SDP answer information and feed back the SDP answer information to the signaling server.

In step S3, a target number of audio receiving channels are created based on the multimedia data transmission channel parameter information for receiving audio stream data in the multimedia conference by a media server through the target number of audio receiving channels.

In an exemplary embodiment of the present disclosure, the multimedia data transmission channel parameter information may include, in particular, the number of audio reception channels to be created, and information of a sampling rate, a code rate, an encoder, a decoder, and the like. The user terminal performs initialization setting through the parameters and creates a target number of audio receiving channels between the user terminal and the media server. For example, the number of audio receiving channels may be 3, 4, 5, or 7, etc. After the audio receiving channels are created, the user terminal may receive the audio stream forwarded by the media server through the audio receiving channels. For the user terminal, the fixed number of audio receiving channels are configured according to the instruction of the signaling server, so that the user terminal does not need to subscribe the audio streams of all participants after joining the multimedia conference, the use pressure of local memory resources can be effectively reduced by the user terminal, and the terminal equipment does not need to have particularly high equipment performance.

In an exemplary embodiment of the present disclosure, the same number of corresponding audio stream data as the target number forwarded by the media server may be received through a target number of the audio receiving channels based on energy of each audio stream data in the multimedia conference.

Specifically, for the media server, during the multimedia conference, an independent video data transmission channel can be created with each user terminal as a video receiving channel of the user terminal. Meanwhile, in the process of the multimedia conference, the media server can forward n paths of audio streams with the highest energy to each receiver; wherein n is a positive integer, typically 3. Based on this, n audio data transmission channels can be created between the media server and each user terminal as audio reception channels of the user terminal. The number of the audio receiving channels of the user terminal is consistent with the number of the audio streams forwarded in the media server, so that n audio receiving channels can be used for receiving n audio stream data with the highest energy forwarded by the media server, and each audio receiving channel only transmits one audio stream data forwarded by the media server. For the media server, the accurate forwarding of the audio stream data of the user who is speaking at present can be ensured by only forwarding the n paths of audio streams with the highest energy; and meanwhile, the data processing pressure of the media server can be reduced.

In an exemplary embodiment of the present disclosure, in an active speaker list of a media server, speakers (i.e., participants) corresponding to n audio streams with the highest energy may be added to the active speaker list. Thus, the number of speakers in the active speaker list can be kept consistent with the number of forwarded audio streams. So that the target number of the audio receiving channels is the same as the number of users in the active speaker list in the media server, so that each audio receiving channel can transmit audio stream data corresponding to one active speaker.

Alternatively, in other exemplary embodiments of the present disclosure, the number of created audio stream receiving channels may be greater than the number of forwarded audio streams. For example, the media server forwards 3 paths of audio stream data with the highest energy, the user terminal creates 4 audio receiving channels between the media servers, and 3 active speakers are configured in an active speaker list in the media server. At this time, 1 of the audio receiving channels can be used as a spare channel. When the network state is unstable, the transmission state of the audio receiving channel is unstable, or the network state between the user terminal (participant) and the media server is unstable, the standby channel can be tried to be activated for audio data transmission. Or, when the media server modifies the active speaker list and increases the number of active speakers, the standby channel is activated to transmit the audio data. By configuring the standby audio receiving channel, the audio receiving channel can be replaced in time when an accident happens, and stable transmission of audio data is guaranteed.

Alternatively, the number of users in the active speaker list may be configured to be different from the number of audio streams forwarded by the media server. For example, the number of users in the active speaker list is configured to be 6, and the media server forwards 4 audio stream data with the highest energy. A fixed 4-way audio receiving channel is created for the user terminal.

In some exemplary embodiments of the present disclosure, for the user terminal and the media server, when the audio receiving channel is created according to the parameter information sent by the signaling server, the video transmitting channel and the audio data uploading channel may be created together; and uploading the local audio data collected by the user terminal to the media server by using the audio data uploading channel. The video transmission channel may include a video uploading channel and a video receiving channel, and the video uploading channel uploads local video stream data to the media server, and the video receiving channel receives forwarded video stream data from the media server.

In some exemplary embodiments of the present disclosure, the method described above may further include:

Specifically, for a user terminal, after joining a multimedia conference, the user terminal can request a signaling server to acquire user information of all participants of the multimedia conference; for example, the user information of other participants can be obtained from the signaling server through a sub audio asl callback instruction. After receiving the user information sent by the signaling server, the user information can be stored in the local first mapping relation table. The User information may include information such as SSRC (Synchronization Source Identifier), MSID (Media Stream Identifier), UID (User Identification) of the participant; the information can be stored in a local first mapping relation table; wherein, the SSRC may be used to identify a synchronization source, and the MSID may be used to describe a correspondence between an audio stream and a video stream in an SDP protocol; the first mapping table may be used to identify a corresponding relationship between various items of information, such as ssrc _ to _ msid _ map.

In some exemplary embodiments of the present disclosure, as shown with reference to fig. 3, the method described above may further include:

step S41, receiving a first protocol message through the audio receiving channel;

step S42, analyzing the first protocol message to obtain information source identification information of a message sender and channel identification information of the audio receiving channel;

and step S43, constructing a mapping relation based on the information source identification information and the channel identification information and storing the mapping relation in a second mapping relation table.

Specifically, the first Protocol packet may be an RTP (Real-time Transport Protocol) packet, and the user terminal may receive the RTP packet through a created fixed audio receiving channel and analyze the received RTP packet to extract information source identification information of a packet sender; and channel identification information of a fixed audio receiving channel corresponding to the RTP message. After the information is extracted, a corresponding mapping relation can be created and stored in the second mapping relation table. The source identification of the sender can be identified through the CSRC; the channel source identification of the fixed channel can be identified through the SSRC; the second mapping relation table may be remote _ ssrc _ original _ ssrc _ map.

Furthermore, in other exemplary embodiments of the present disclosure, as shown with reference to fig. 4, the method described above may further include:

step S51, receiving a second protocol message through the audio receiving channel;

step S52, the second protocol packet is analyzed to obtain the special information source identification information CSRC, the real-time transport protocol RTP, and the network time protocol NTP, and a corresponding mapping relationship is constructed and stored in the third mapping relationship table.

Specifically, the second Protocol packet may be an APP type RTCP (Real-time Transport Control Protocol) packet instead of the SR format RTCP packet. The user terminal may receive and analyze the RTCP message through the fixed audio receiving channel, and obtain private field information such as CSRC (distributing Source Identifier, special Source Identifier), NTP (Network Time Protocol), RTP (Real-Time Transport Protocol), and the like. After each item of information is extracted, a corresponding relation is constructed and stored in a third mapping relation table. Wherein, the third mapping relation table can be stored in a sound and picture synchronization device (AVSync) of the user terminal.

Based on the above, in some exemplary embodiments of the present disclosure, the method described above may further include:

Specifically, in the WebRTC framework, video may be transmitted through the RTCPeerConnection API. For the terminal, AVSync can obtain the corresponding relationship between the SSRC of the fixed channel and the corresponding NTP, RTP, MSID through the relevant identifier information, ID information, corresponding channel identifier information and relevant protocol information of each participant in the first mapping table, the second mapping table and the third mapping table, and based on these information and the identifier information of the video data, can implement audio-video synchronization, synchronize the audio stream data and the video stream data received by the media server, and display them at the user terminal. The first mapping relation table and each third mapping relation table of the second mapping relation table are respectively constructed in advance according to the received data packet, so that the association relation between the audio stream data sent by the media server and the specific audio receiving channel and the information source can be determined according to the association relation among the relation tables, and the data accuracy in the sound and picture synchronization process is ensured.

In some exemplary embodiments of the present disclosure, for the user terminal, in the multimedia conference process, the received message may also be analyzed, and whether a stream cut occurs is determined according to the content of the message header. Specifically, referring to fig. 5, the method described above may include:

step S61, analyzing the received audio message to obtain the corresponding cut stream expansion field;

and step S62, when the change of the cut stream extension field is identified, judging that the cut stream operation of the media server occurs.

In particular, an RTP packet may generally include a header and a payload; wherein the header may include: version (V), which occupies 2 bits, identifies the RTP protocol version; a padding mark (P) occupying 1 bit; an extension flag (X) occupies 1 bit, and if X is 1, an extension header follows the RTP header; a CSRC Counter (CC) occupying 4 bits and indicating the number of CSRC identifiers; a marker (M) which occupies 1 bit, different payloads have different meanings, and for a video, the end of a frame is marked; for audio, mark the beginning of a conversation; and the Payload Type (PT) occupies 7 bits and is used for explaining the type of the payload in the RTP message. For example, a cut stream extension field may be added in the X extension field. Of course, the cut stream extension field may be added to other positions of the header; alternatively, the cutstream extension field may be added to the packet.

For any fixed audio receiving channel, when the currently transmitted audio stream data is switched, that is, when stream switching occurs, the stream switching extension field may be modified in the header, and +1 is added to the stream switching extension field in the header of the previous RTP packet. If the current is the first occurrence of the cut flow, setting the extended field of the cut flow to 1; if the current flow switching expansion field is m, the current flow switching expansion field is modified into m + 1; wherein m is an integer. The media server can carry the count value of each stream switching by adding a stream switch over field (stream switch over field) in the audio RTP message, and the count value is added by 1 along with each stream switching of the media server, so that the user terminal can accurately judge whether to send the stream switching operation.

For the terminal device, the cut flow event may be determined according to the increment according to the value. For example, when an RTP packet is received for the first time, a specific value of the tangential flow extension field may be extracted and stored in the process of parsing the packet. When receiving RTP message subsequently, the specific value of the current extracted tangent flow extension field can be compared with the stored value; if the switching expansion field changes, judging that the current switching operation occurs; that is, the currently received message and the previously received message do not belong to the same audio stream.

Based on the above, in some exemplary embodiments of the present disclosure, referring to fig. 5, the method may further include:

step S63, performing a reset operation on the jitter buffer; and

in step S64, an emptying operation is performed on the packet loss retransmission list.

Specifically, for the user terminal, after determining that the current stream switching operation of the media server occurs and the current received message is the message data of different sources, the state of the jitter buffer (Jitterbuffer) can be reset (reset); meanwhile, a packet loss retransmission list (NACK list) can be emptied, so that all data currently stored in the jitter buffer are messages after stream switching, and the packet loss retransmission list is also applied to messages corresponding to the information source after stream switching; therefore, the receiving end is effectively prevented from being abnormal when the media server switches the audio stream, and the audio RTP message received by the receiving end is prevented from being out of order; further, when the fixed number of audio receiving channels are used for receiving the audio data of different participants, the received audio messages can not be out of order, and the audio of the different participants can be played normally

In addition, in some exemplary embodiments of the present disclosure, during the multimedia conference, it is also possible for the user terminal to manage the participants in real time. Specifically, the method may further include: and responding to the exit indication information of the local user, and calling a second clearing interface to clear the first mapping relation table.

For example, during a multimedia conference, participants can join or leave the multimedia conference at any time. For the user terminal, if the user terminal currently exits the multimedia conference, for example, when the terminal device clicks an "exit" control, or directly closes the application program, the user terminal may generate a local user exit indication message in response to the user operation. At this time, the user terminal may respond to the indication information, call a second clearing interface, and perform clearing of the first mapping relation table by using the second clearing interface. In particular, the second clearing interface may be a set Audio Recv Stream Info interface, which is executed to clear all participant information already existing at once.

Based on the above, in some exemplary embodiments of the present disclosure, the method described above may further include: and responding to the user exit indication information from the signaling server, calling a first clearing interface for modifying the first mapping relation table, and clearing the user information of the exiting user.

For example, in the process of a multimedia conference, if a participant exits the conference, when the participant exits the conference or exits the application program, notification information for exiting the conference can be sent to the signaling server and the media server, respectively. For the signaling server, after detecting that the participant exits the conference, the signaling server can respectively send user exit indication information to other participants. The user exit indication information at least includes information such as UID, MSID, SSRC and the like of the user exiting the multimedia conference. After the user terminal of each participant receives the user quitting indication information, the relevant information of the quitting user can be eliminated in the first mapping table. In addition, based on the relationship among the first mapping relationship table, the second mapping relationship table and the third mapping relationship table, the information about the exit user in the relationship table can be synchronously cleared. In addition, the information about the quitting user may be deleted in the sound and picture synchronization module. Specifically, the information removal operation for the exiting user may be performed by calling the first clearing interface. The first clearing interface may be a clear Audio Recv Stream Info interface, and the interface is executed to remove the existing participant information locally at the user terminal. And then local resources can be released in time.

In some exemplary embodiments of the present disclosure, the method described above may further include: and monitoring audio stream data corresponding to each participant in the multimedia conference through each data statistical channel so as to obtain audio characteristic data of each participant.

Specifically, in order to effectively monitor data transmitted by the audio receiving channels, data statistics channels may be configured for the audio receiving channels, respectively. For example, when a fixed 3-channel audio receiving channel is created between the terminal and the media server, a 3-channel data statistics channel may be created synchronously, and an audio receiving channel may be monitored by using a data statistics channel. And the data statistics channels are used for carrying out statistics on the audio stream data of all participants in real time and reporting.

Each data statistics channel may be used to count the information related to the audio media packets received by the local end, for example: receiving code rate, packet loss rate, audio jitter buffer correlation buffer length and the like. The data statistics module plays an important role in monitoring audio data reception and guaranteeing audio subjective feeling.

In other exemplary embodiments of the present disclosure, for the signaling server, two configuration policies may be configured, where the first configuration policy includes a policy of configuring a fixed number of audio receiving channels for the participating user terminals; the second configuration policy includes a policy for subscribing the audio streams of each of the other participants to the participant's user terminals, respectively. For the signaling server, the first configuration policy may be enforced by default. Before creating a data transmission channel between the user terminal and the media server, the network state, the server state and the terminal state can be evaluated. For example, when the current network state is evaluated, or the server state is not suitable for executing the first configuration policy, then execution of the second configuration policy is rolled back.

Alternatively, the second configuration policy may be executed by default. For the signaling server, the network status or the number of participants can be monitored. For example, in the process of the multimedia conference, when the number of participants of the current multimedia conference is larger than a first threshold value with a larger numerical value; or, when the number of the participants is smaller than the first threshold but larger than a second threshold with a relatively small value and the current network state is unstable, the second configuration policy may be switched to the first configuration policy, and the signaling server issues configuration information to the user terminals of the participants, so that the client updates the configuration parameters. In addition, when the configuration policy is switched, the current computing capability and storage space of the server and the participant terminal device can be used as reference standards. For example, when the storage space and the computing power of the user terminal of the participant are smaller than the lower preset threshold, the indication information may be sent to the signaling server, so that the signaling server changes the configuration policy.

By monitoring relevant parameters in real time before the multimedia conference starts or in the multimedia conference process and correspondingly adjusting the configuration strategy, the configuration strategy of the signaling server can better accord with the current scene and state, and timely rollback is realized.

In some exemplary embodiments of the present disclosure, the method described above may be implemented in the form of an application or a system plug-in for a user terminal. Referring to FIG. 6, the system framework may include a cross-platform engine packaging layer at an upper level and a WebRTC engine layer 62 at a lower level. The cross-platform engine encapsulation layer may be a cloud-trusted LAVA engine layer 61. In particular, a cross-platform engine encapsulation layer located at an upper layer can be used for interacting with a signaling server and transmitting data with a WebRTC engine at a lower layer.

For example, when a multimedia conference is added, a set SDP Answer interface can be called to Answer multimedia data transmission channel parameter information from a signaling server, and corresponding SDP Answer information is generated; real user information of other participants is obtained from a signaling server by calling a sub Audio asl callback interface, the information is set to a Lava layer by calling a set Audio Recv Stream Info interface, and the participant user information is set to a WebRTC engine by the Lava layer and is stored in a ssrc _ to _ msid _ map mapping table. The WebRTC engine may also be configured Asl lists by calling the set Asl Version interface. In addition, during the conference, the existing participant information may be removed by calling the set Audio Recv Stream Info interface, or all the participant information may be cleared at once by calling the clear Audio Recv Stream Info interface.

The WebRTC engine 62 located at the lower layer may include a peer connection interface module (peer interface)621, an Audio Media Channel module (WebRTC Voice Media Channel)622, an Audio video synchronization module (AVSync)623, an Audio Stream receiving module (Audio Recv Stream)624, and a cache module (NetEQ) 625. The RTP message received by the user terminal from the media server can be parsed for the header by the RTP parsing module in the WebRTC engine to determine whether a stream switch over occurs and send the result of the stream switch parsing to the audio stream receiving module, the parsed mapping relationship between CSRC and SSRC is stored in the SSRC-to-MSID-map, which can be stored in the audio media channel module, and the remote-SSRC-original-SSRC-map, the SSRC, MSID, UID of the participant user obtained by the signaling server can also be stored in the audio media channel module, the received RTCP packet from the media server can be parsed by the RTCP parsing module (Parse RTCP)626 in the WebRTC engine, the parsed CSRC, NTP, RTP information can be sent to the sound picture synchronization module, the audio media channel module can synchronize the mapping relationship table to the sound picture synchronization module, the audio media channel module can also synchronously clear data from the sound and picture synchronization module by calling the clear Sync interface. The audio and video synchronization module synchronizes audio data and video data through the user identification information and the channel identification information, and the generated data can be stored in the NetEQ cache through the audio stream receiving module.

In summary, in the method provided by the present disclosure, the media server forwards n paths of audio stream data with the highest energy in the multimedia conference process, and the user terminals of the participants create a fixed number of n paths of audio receiving channels between the user terminal and the media server by using the signaling server, that is, create the number of audio receiving channels equal to the number of ASLs at the server, so that each user terminal can receive the audio data of other participants forwarded by the media server by using the fixed number of audio receiving channels, and does not need to subscribe all participants to create the audio receiving channels with the same number as the participants, thereby greatly reducing the memory consumption of the server and the user terminals. Meanwhile, only the received audio information of the ASL quantity is required to be reserved in the SDP answer, and the length of the SDP answer is shortened, so that the performance consumption of the SDP answer analyzed by a receiving end is reduced. In addition, the fixed channel is used for transmitting the audio data, so that the protocol length related to the SDP is reduced, the performance of a receiving end is improved, and the waiting time for a user to hear the sound of other participants is shortened. Based on the method, the fixed channel is created to receive the audio data, so that the performance of the receiving end is improved, the realization of a large-scale audio conference (the number of participants exceeds 100) becomes possible, and the user experience is greatly improved.

Exemplary devices

Having introduced the image processing method of the exemplary embodiment of the present disclosure, next, an image processing apparatus of the exemplary embodiment of the present disclosure is described with reference to fig. 7.

Referring to fig. 7, the multimedia conference management apparatus 70 of the exemplary embodiment of the present disclosure may include: a conference response module 701, a response module 702, and a channel establishment module 703, wherein:

the conference response module 701 may be configured to join a multimedia conference based on conference management information.

The response module 702 may be configured to generate an initialization response message in response to the multimedia data transmission channel parameter information from the signaling server and send the initialization response message to the signaling server.

The channel establishing module 703 may be configured to create a target number of audio receiving channels based on the multimedia data transmission channel parameter information, so as to receive audio stream data in the multimedia conference by a media server through the target number of audio receiving channels.

According to an exemplary embodiment of the present disclosure, the channel establishing module 703 may include: an audio stream data selection unit.

The audio stream data selection unit may be configured to receive, through a target number of the audio receiving channels, the same number of corresponding audio stream data as the target number forwarded by the media server based on energy of each path of audio stream data in the multimedia conference.

According to an exemplary embodiment of the present disclosure, the target number of audio receiving channels is the same as the number of users in the active speaker list in the media server.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a first mapping relation processing module 704.

The first mapping relation processing module 704 may be configured to obtain user information of each participant of the multimedia conference from the signaling server, and store the user information in a first mapping relation table.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a first protocol message processing module 705 and a second mapping relation processing module 706; wherein:

the first protocol packet processing module 705 may be configured to receive a first protocol packet through the audio receiving channel; and analyzing the first protocol message to acquire information source identification information of a message sender and channel identification information of the audio receiving channel.

The second mapping relation processing module 706 may be configured to construct a mapping relation based on the source identification information and the channel identification information and store the mapping relation in a second mapping relation table.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a third mapping relation processing module 707.

The third mapping relation processing module 707 may be configured to receive a second protocol packet through the audio receiving channel; and analyzing the second protocol message to acquire special information source identification information CSRC, a real-time transmission protocol RTP and a network time protocol NTP, and constructing and storing a corresponding mapping relation in a third mapping relation table.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: and a sound and picture synchronization processing module 708.

The audio-video synchronization processing module 708 may be configured to perform audio-video synchronization processing on the received audio stream data and video stream data based on the first mapping relation table, the second mapping relation table, and the third mapping relation table.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a tangential flow verification module 709.

The tangential flow inspection module 709 may be configured to analyze the received audio packet to obtain a corresponding tangential flow extension field; and when the switching expansion field is identified to be changed, judging that the media server has a switching operation.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: and a tangential flow management module 710.

The stream switching management module 71 may be configured to perform a reset operation on the jitter buffer when it is determined that the stream switching operation occurs in the media server; and performing emptying operation on the packet loss retransmission list.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a statistical channel configuration module 711 and a channel monitoring module 712; wherein:

the statistical channel configuration module 711 may be configured to configure a data statistical channel for each of the audio receiving channels.

The channel monitoring module 712 may be configured to monitor audio stream data corresponding to each participant in the multimedia conference through each data statistics channel, so as to obtain audio feature data of each participant.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a first clear response module 713.

The first clearing response module 713 may be configured to respond to the user exit indication information from the signaling server, to invoke a first clearing interface for modifying the first mapping relation table, and to clear the user information of the exiting user.

According to an exemplary embodiment of the present disclosure, as shown with reference to fig. 8, the apparatus 70 may further include: a second clear response module 714.

The second clear response module 714 may be configured to invoke a second clear interface to clear the first mapping relationship table in response to the local user exit indication information.

Each functional module of the multimedia conference management apparatus 70 of the present disclosure corresponds to the content setting of the above-described multimedia conference management method. Based on this, each functional module in the multimedia conference management device 70 can implement the related implementation with the same corresponding method content, and each functional module in the device is consistent with the corresponding method embodiment, so the embodiment of the device is not described herein again.

Exemplary storage Medium

Having described the multimedia conference management method and apparatus according to the exemplary embodiments of the present disclosure, a storage medium according to the exemplary embodiments of the present disclosure will be described with reference to fig. 9.

Referring to fig. 9, a program product 90 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary electronic device

Having described the storage medium of the exemplary embodiment of the present disclosure, next, an electronic device of the exemplary embodiment of the present disclosure will be described with reference to fig. 10.

The electronic device 800 shown in fig. 10 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 800 is embodied in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting different system components (including the memory unit 820 and the processing unit 810), and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may perform the steps as shown in fig. 1.

The memory unit 820 may include volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may include a data bus, an address bus, and a control bus.

The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), which may be through an input/output (I/O) interface 850. The electronic device 800 further comprises a display unit 840 connected to the input/output (I/O) interface 850 for displaying. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several modules or sub-modules of the audio playback device and the audio sharing device are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for multimedia conference management, comprising:

2. The method for multimedia conference management according to claim 1, wherein said receiving audio stream data in the multimedia conference by a media server through a target number of said audio receiving channels comprises:

3. The multimedia conference management method according to claim 1 or 2,

the target number of audio receiving channels is the same as the number of users in the active speaker list in the media server.

4. The method of multimedia conference management according to claim 1, further comprising:

5. The method of multimedia conference management according to claim 1 or 4, characterized in that it further comprises:

receiving a first protocol message through the audio receiving channel;

6. The method of multimedia conference management according to claim 5, further comprising:

receiving a second protocol message through the audio receiving channel;

7. The method of multimedia conference management according to claim 6, further comprising:

8. A multimedia conference management apparatus, comprising:

9. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the multimedia conference management method of any of claims 1-7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the multimedia conference management method of any of claims 1-7 via execution of the executable instructions.