CN117459507A

CN117459507A - Instant messaging method, system and related device

Info

Publication number: CN117459507A
Application number: CN202311496282.4A
Authority: CN
Inventors: 洪润琦; 周志伟
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-01-26

Abstract

The embodiment of the invention provides an instant messaging method, an instant messaging system and a related device, wherein the method comprises the following steps: the SFU server cluster receives audio streams sent by a plurality of clients in a current receiving time window; the audio stream carries the sound intensity corresponding to each RTP packet contained in the receiving time window; for each path of audio stream, determining the real-time sound intensity of the path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream, and selecting a first preset number of paths of audio streams from the paths of audio streams according to the sequence from high to low of the real-time sound intensity; and sending the selected audio streams to a plurality of clients so that the clients can mix the received audio streams to obtain target audio streams for playing. The resource utilization rate of the instant messaging system and the audio quality of the target audio stream can be simultaneously ensured.

Description

Instant messaging method, system and related device

Technical Field

The present invention relates to the field of communications technologies, and in particular, to an instant messaging method, system, and related devices.

Background

WebRTC (Web Real-Time Communication, web instant messaging) has drastically changed Real-time communications, enabling multi-user applications such as live and video conferencing.

In the current real-time communication technology, when the number of users participating in communication is large, the users can generate a large amount of audio streams in the communication process, so that the problems of overlarge occupied system bandwidth and overlarge consumption of processing resources are serious, and the communication system is difficult to support large-scale instant messaging.

In addition, under the condition of a large number of users, the receiving end needs to process and play a large number of audio streams, which may finally cause problems of chaotic content and poor quality of the played audio.

Thus, in such a scenario, it is a significant challenge to guarantee both audio quality and resource utilization.

Disclosure of Invention

The embodiment of the invention aims to provide an instant messaging method, an instant messaging system and a related device so as to ensure the resource utilization rate of an instant messaging system and the audio quality of a target audio stream at the same time. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided an instant messaging method, applied to an SFU server cluster in an instant messaging system, where the instant messaging system further includes a plurality of clients, and the plurality of clients are respectively communicatively connected to the SFU server cluster, where the method includes:

Receiving the audio streams sent by the plurality of clients in the current receiving time window; the audio stream carries the sound intensity corresponding to each RTP packet in the receiving time window;

for each path of audio stream, determining the real-time sound intensity of the path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream, and selecting a first preset number of paths of audio streams from the received audio streams according to the sequence of the real-time sound intensity from high to low;

and sending the selected audio streams to the plurality of clients so that the clients can mix the received audio streams to obtain target audio streams for playing.

Optionally, the SFU server nodes in the SFU server cluster adopt a hierarchical structure, and the audio stream is transmitted step by step in the SFU server cluster, and according to the transmission sequence of the audio stream, the SFU server nodes in the SFU server cluster at least include: a primary or multi-level audio stream screening server node and a primary summary stream selection server node;

determining the real-time sound intensity of each path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream, and selecting the first preset number of paths of audio streams from the received audio streams according to the sequence of the real-time sound intensity from high to low, wherein the method comprises the following steps:

The audio stream screening server node receives the audio streams sent by the client/the audio stream screening server node at the upper stage, determines the real-time sound intensity of each channel of audio stream based on the sound intensity carried in each RTP packet contained in the channel of audio stream for each channel of audio stream, and selects a first preset number of channels of audio streams from the received audio streams to send to the SFU server node at the lower stage according to the sequence from high to low of the real-time sound intensity;

the summarizing stream selection server node receives the audio streams sent by each audio stream screening server node at the upper stage, determines the real-time sound intensity of each audio stream based on the sound intensity carried in each RTP packet contained in each audio stream, and selects a first preset number of audio streams from the received audio streams according to the sequence from high to low of the real-time sound intensity, and sends the first preset number of audio streams to the client.

Optionally, the RTP packet of the audio stream further carries a detection result mark for voice endpoint detection, and the detection result mark carried in each RTP packet is a voice mark or a non-voice mark;

the method for determining the real-time sound intensity of each path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream further comprises:

And determining a non-voice audio stream from the received audio stream based on detection result marks carried in each RTP packet contained in the audio stream, and rejecting the non-voice audio stream.

Optionally, the method further comprises:

determining the maximum number of audio streams input to any stage of audio stream screening server node based on the number of clients in the communication system and the first preset number;

the configuration of the number of the audio stream screening server nodes is carried out based on the maximum number and a preset pull stream upper limit; the preset pull stream upper limit characterizes a maximum number of audio streams that a single audio stream screening server node is allowed to receive.

Optionally, a plurality of candidate pull stream upper limits are preconfigured for the SFU server node, and each of the candidate pull stream upper limits corresponds to an expected number of intervals in advance, and the method further includes:

determining an expected number interval in which the actual number of clients in the communication system is located in a plurality of the expected number intervals, and setting a candidate pull stream upper limit corresponding to the expected number interval as the preset pull stream upper limit;

the expected number of sections corresponding to any candidate pull stream upper limit is pre-specified in the following manner:

Based on the load balancing condition, deducing the mapping relation between the average pulling flow number and the number of clients in the instant messaging system; the load balancing conditions are as follows: the average number of pull streams of the SFU server nodes of each stage is consistent, and the average number of pull streams represents the ratio of the maximum number of audio streams input into the SFU server nodes of the stage to the number of SFU server nodes of the stage;

substituting the candidate pull stream upper limit as the average pull stream number into the mapping condition to obtain the expected number of the clients in the instant messaging system corresponding to the candidate pull stream upper limit, and designating the expected number interval of the clients in the instant messaging system corresponding to the candidate pull stream upper limit based on the expected number.

In a second aspect of the present invention, there is also provided an instant messaging method applied to a client in an instant messaging system, where the instant messaging system includes a plurality of clients and an SFU server cluster, and the plurality of clients are respectively connected to the SFU server cluster in a communication manner, and the method includes:

transmitting an audio stream to the SFU server cluster, wherein each RTP packet of the audio stream carries sound intensity corresponding to the RTP packet, so that the SFU server determines real-time sound intensity of each path of audio stream based on sound intensity carried by each RTP packet contained in the current receiving time window of the path of audio stream aiming at each path of audio stream received in the current receiving time window, and selects a first preset number of paths of audio streams from the received audio streams to transmit to the plurality of clients according to the order of the real-time sound intensity from high to low;

And receiving the audio streams sent by the SFU server cluster, and performing audio mixing processing on the received audio streams to obtain a target audio stream for playing.

Optionally, the RTP packet of the audio stream also carries a unique identifier UID of the client sending the audio stream;

the audio mixing processing is performed on each received audio stream to obtain a target audio stream for playing, including:

judging whether each received audio stream contains an audio stream sent by the user or not, wherein the UID carried in the RTP packet of each audio stream is consistent with the UID of the user;

if yes, removing the path of audio stream from the received audio stream, and carrying out audio mixing processing on the other paths of audio streams to obtain the target audio stream;

if not, carrying out audio mixing processing on all received audio streams to obtain the target audio stream.

In a third aspect of the present invention, there is also provided an instant messaging system, including an SFU server cluster and a plurality of clients, where the plurality of clients are respectively communicatively connected to the SFU server cluster;

the client is configured to send an audio stream to the SFU server cluster; each RTP packet of the audio stream carries sound intensity corresponding to the RTP packet;

The SFU server cluster is configured to receive audio streams sent by the plurality of clients in a current receiving time window, determine real-time sound intensity of the audio stream based on sound intensity carried in each RTP packet contained in the receiving time window of each audio stream, and select a first preset number of audio streams from the received audio streams according to the order of the real-time sound intensity from high to low and send the first preset number of audio streams to the plurality of clients;

the client is further configured to mix the received audio streams to obtain a target audio stream for playing.

In a fourth aspect of the present invention, there is also provided an instant messaging device applied to an SFU server cluster in an instant messaging system, where the instant messaging system further includes a plurality of clients, the plurality of clients are respectively communicatively connected to the SFU server cluster, and the device includes:

the receiving module is used for receiving the audio streams sent by the clients in the current receiving time window; the audio stream carries the sound intensity corresponding to each RTP packet in the receiving time window;

The selection module is used for determining the real-time sound intensity of each path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream, and selecting a first preset number of paths of audio streams from the received audio streams according to the sequence of the real-time sound intensity from high to low;

and the first sending module is used for sending the selected audio streams to the plurality of clients so that the clients can mix the received audio streams to obtain target audio streams for playing.

In a fifth aspect of the present invention, there is also provided an instant messaging device applied to a client in an instant messaging system, where the instant messaging system includes a plurality of clients and an SFU server cluster, and the plurality of clients are respectively communicatively connected to the SFU server cluster, where the device includes:

the second sending module is configured to send an audio stream to the SFU server cluster, where each RTP packet of the audio stream carries a sound intensity corresponding to the RTP packet, so that the SFU server determines, for each audio stream received in a current receiving time window, a real-time sound intensity of the audio stream based on the sound intensities carried by the audio stream in each RTP packet included in the receiving time window, and selects, according to the order of the real-time sound intensity from high to low, a first preset number of audio streams from the received audio streams, and sends the first preset number of audio streams to the plurality of clients;

And the stream mixing module is used for receiving the audio streams sent by the SFU server cluster, and carrying out audio mixing processing on the received audio streams to obtain a target audio stream for playing.

In a sixth aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing any one of the instant messaging methods when executing the program stored in the memory.

In yet another aspect of the implementation of the present invention, there is also provided a computer readable storage medium, in which a computer program is stored, the computer program implementing any one of the above-mentioned instant messaging methods when executed by a processor.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the instant messaging methods described above.

In the instant messaging method provided by the embodiment of the invention, the SFU server cluster is used for receiving the audio stream sent by each client in the instant messaging system, and each RTP packet of the audio stream carries the sound intensity corresponding to the RTP packet. The SFU server cluster determines the real-time sound intensity of each path of audio stream based on the sound intensity carried in the RTP packet of each path of audio stream, selects the first preset number of paths of audio streams from the received audio streams according to the sequence from high to low of the real-time sound intensity, and sends the selected audio streams to each client in the instant messaging system, so that the client performs mixing processing on the received paths of audio streams to obtain a target audio stream for playing.

After receiving the audio streams sent by the client terminal, the SFU server cluster can determine the real-time sound intensity of each path of audio streams based on the sound intensity carried in the RTP packet of each path of audio streams, and selectively forward the audio streams with higher real-time sound intensity by screening the audio streams, so that the received audio streams are prevented from being fully forwarded by the SFU server, and the processing load and bandwidth consumption of the SFU server cluster can be reduced. On the basis of low processing burden of the SFU server cluster, the instant messaging system can also support a larger number of clients to participate in the instant messaging process at the same time, and the interactive experience in the instant messaging process is improved.

In an actual application scene, the human ear can only identify a limited number of audio frequencies from different sources, so that the SFU server cluster selectively forwards the audio frequency stream with higher sound intensity, and the client can be ensured to only mix the audio frequency stream with higher audio frequency quality, thereby obtaining clearer and richer target audio frequency streams for output.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic diagram of an instant messaging system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of an instant messaging method according to an embodiment of the present invention;

fig. 3 is another schematic diagram of an instant messaging system according to an embodiment of the present invention;

fig. 4 is yet another schematic diagram of an instant messaging system through which an embodiment of the present invention may be implemented;

fig. 5 is another flow chart of the instant messaging method according to the embodiment of the present invention;

fig. 6 is a schematic diagram of an extension header of an RTP packet according to an embodiment of the invention;

fig. 7 is a schematic structural diagram of an instant messaging device according to an embodiment of the present invention;

fig. 8 is a schematic diagram of another structure of an instant messaging device according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

In order to solve the problem that the current instant communication method cannot simultaneously ensure the audio quality and the resource utilization rate in the multi-user communication scene, the embodiment of the invention provides an instant communication method which is particularly applied to an SFU (Selective Forwarding Unit ) server cluster in an instant communication system.

Fig. 1 is a schematic diagram of an instant messaging system provided in an embodiment of the present invention, referring to fig. 1, in which SFU server clusters, clients 1, clients 2, clients 3, clients 4, and clients 5 are specifically shown. The instant messaging system is particularly useful for enabling multi-person instant voice communications, as one example, the system may be particularly useful for enabling multi-person instant voice conferences.

The client is a terminal device participating in instant messaging, and a communication connection is established between the client and the SFU server cluster, so that the client can perform bidirectional communication with the SFU server cluster.

The SFU server cluster specifically may include a plurality of SFU server nodes, and is specifically configured to forward the audio stream.

Fig. 2 is a schematic flow chart of an instant messaging method according to an embodiment of the present invention, referring to fig. 2, the method specifically includes the following steps:

step S201: receiving audio streams sent by a plurality of clients in a current receiving time window; each RTP (Real-time Transport Protocol, real-time transmission protocol) packet contained in the audio stream in the receiving time window carries the sound intensity corresponding to the RTP packet;

Step S202: for each path of audio stream, determining the real-time sound intensity of the path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream, and selecting a first preset number of paths of audio streams from the received audio streams according to the sequence from high to low of the real-time sound intensity;

step S203: and sending the selected audio streams to a plurality of clients so that the clients can mix the received audio streams to obtain target audio streams for playing.

The instant messaging method provided by the embodiment of the invention is specifically described below.

Specifically, in the communication process, each client in the instant communication system may send a path of audio stream to the SFU server cluster, and each path of audio stream is specifically sent to the SFU server cluster in the form of an RTP packet.

In the embodiment of the present invention, each RTP packet in the audio stream carries a sound intensity corresponding to the RTP packet, where the sound intensity is specifically understood as an average acoustic energy flow per unit area perpendicular to the propagation direction of sound, which is also called average acoustic energy flow density. As one example, sound intensity may be characterized by sound pressure level.

Specifically, each RTP packet includes two parts, namely a Header (Header) and Payload data (Payload), the Payload data is specific audio data, other relevant data related to the audio data is filled in the Header, and the specific data filled in the Header can refer to the content in the related art. And the header of the RTP packet specifically includes a fixed header and an extended header, and the sound intensity related to the embodiment of the invention can be specifically carried in the extended header.

The client may analyze the audio data corresponding to the RTP packet to obtain the sound intensity of the RTP packet before sending the RTP packet to the SFU server cluster, so that the RTP packet sent to the SFU server cluster carries the sound intensity, which may refer to the content in the related art, and the embodiment of the present invention is not limited in this specific process.

It should be appreciated that in the application scenario according to the embodiments of the present invention, the transmission of the audio stream is actually a continuous process, and the volume value of the audio stream may change in real time. That is, the client may continuously send different RTP packets to the SFU cluster, and the sound intensity carried in each RTP packet is different. Therefore, in the embodiment of the invention, the time length of the receiving time window can be set based on the actual requirement, and the received audio stream is subjected to subsequent processing by taking the time window as granularity. For an audio stream received during the SFU server during a current receive time window, the corresponding sound intensity of the audio stream within the receive time window, or the real-time sound intensity of the audio stream, may be determined based on the sound intensities carried by the audio stream in the respective RTP packets contained within the receive time window.

In this process, the specific method for determining the real-time sound intensity of each audio stream may be selected based on actual requirements, which is not limited by the embodiment of the present invention. As an example, if the time length of the receiving time window is T, if the SFU server cluster receives x RTP packets for one audio stream in the current T period, the average value of the sound intensities in the x RTP packets after smooth filtering may be used as the current real-time sound intensity of the audio stream.

After determining the real-time sound intensity of each path of audio stream, the SFU server cluster can select a first preset number of paths of audio streams from the received audio streams according to the sequence from high to low of the real-time sound intensity, wherein the first preset number can be selected according to actual requirements, and the embodiment of the invention does not limit the method. As an example, the first preset number may be 3.

After a first preset number of audio streams are selected from the ordered audio streams, the SFU server cluster sends the selected audio streams to clients in the system. Specifically, the SFU server cluster may send the selected audio stream to each client in the instant messaging system, or a publish-subscribe mode may also be adopted between the client and the SFU server cluster, where the SFU server cluster may specifically send the selected audio stream to each client subscribing to the audio stream, and the embodiment of the present invention is not limited thereto.

Specifically, each client may be configured with an audio processing module, so that after receiving an audio stream sent by the SFU server cluster, the client may perform audio mixing processing on each received audio stream to obtain a target audio stream for playing. The audio mixing specifically integrates audio streams from different sources into a single-channel target audio stream, so that in the playing process of the target audio stream, the client terminal can hear audio contents issued by a plurality of different clients.

Specifically, after receiving the audio streams sent by the SFU server cluster, the client may decode RTP packets of each audio stream, and mix specific audio data therein to obtain a target audio stream. In this process, specific mixing means may refer to the content in the related art, and the embodiment of the present invention is not limited thereto.

Taking fig. 1 as an example, a client 1 sends an audio stream 1 to an SFU server cluster, a client 2 sends an audio stream 2 to the SFU server cluster, a client 3 sends an audio stream 3 to the SFU server cluster, a client 4 sends an audio stream 4 to the SFU server cluster, and a client 5 sends an audio stream 5 to the SFU server cluster. And, the RTP of the audio stream sent by each client carries the sound intensity corresponding to the RTP packet.

After receiving the audio streams 1 to 5, the SFU server cluster may determine, according to the sound intensities carried in the RTP packets of each path of audio stream, the sound intensity corresponding to each path of audio stream, where if it is determined that the audio streams sequentially include, in order from high to low of the sound intensities: audio stream 5, audio stream 4, audio stream 3, audio stream 2, audio stream 1. If the first preset number is specifically 3, the SFU server cluster may select the audio stream 5, the audio stream 4, and the audio stream 3 therein, and send the selected audio streams to each of the clients 1 to 5.

After each receiving end in the clients 1 to 5 receives the audio stream 5, the audio stream 4 and the audio stream 3 sent by the SFU server cluster, decoding and mixing the audio streams to obtain a target audio stream for comprehensive mixing for audio output. Taking the client 1 as an example, the client 1 may specifically decode and mix the received audio stream 5, audio stream 4 and audio stream 3 to obtain a target audio stream, and output audio, where the client 1 may hear the audio content issued by the client 5, client 4 and client 3.

In addition, the instant messaging method provided by the application is particularly based on an SFU (small form-factor pluggable) architecture instant messaging system, and improves the communication process, and because the mixing process is particularly carried out at one side of a client, each client only needs to process a small number of audio streams, compared with the process of mixing by configuring an MCU (Multipoint Conferencing Unit, a multipoint control unit) server in the instant messaging system, the processing efficiency is higher, the stability dependence on the MCU server can be avoided, the robustness of the instant messaging system is improved, the time delay of the instant messaging system is lower, and the client can carry out real-time decoding mixing after receiving the audio streams, thereby guaranteeing low delay, and further providing seamless and context-coherent audio experience for users.

In one embodiment of the present invention, the RTP packets of the audio stream further carry a detection result mark for voice endpoint detection, and the detection result mark carried in each RTP packet is a voice mark or a non-voice mark;

before determining the real-time sound intensity of each audio stream based on the sound intensity carried in each RTP packet included in the audio stream, the method further includes:

and determining a non-voice audio stream from the received audio stream based on detection result marks carried in each RTP packet contained in the audio stream, and eliminating the non-voice audio stream.

In a practical application scenario, the audio stream that part of the client sends to the SFU server cluster contains may be specifically a non-speech signal, such as a noise signal. If the SFU server cluster sorts and forwards the audio streams containing noise signals, i.e. the sound intensities of the noise audio streams, this will result in a waste of SFU server cluster forwarding resources and bandwidth. In order to solve this problem, the embodiment of the present invention may perform voice endpoint detection (Voice Activity Detection, VAD) on the corresponding audio data before the client sends the RTP packet to the SFU server cluster, determine whether the audio data is a voice signal or a non-voice signal, and add a corresponding detection result flag in the RTP packet according to the detection result. The specific voice activity detection algorithm may be selected based on actual requirements, which is not limited by the embodiment of the present invention.

As an example, the detection result flag in each RTP packet may specifically be 0 or 1, where 0 is specifically a non-voice flag, which indicates that the audio data carried in the RTP packet is a non-voice signal, and 1 is a voice flag, which indicates that the audio data carried in the RTP packet is a voice signal. As an example, the detection result flag may be carried in particular within an extension header of the RTP packet.

In this case, the SFU server cluster may identify, for the audio stream received in the current receiving time window, a non-voice audio stream from the received audio streams according to the detection result marks carried in the RTP packets included in the receiving time window by each audio stream, and reject the non-voice audio stream, and the remaining audio streams are voice audio streams. And then, the real-time sound intensity of each voice audio stream can be determined based on the sound intensity carried in the RTP packet contained in each voice audio stream, and the first preset number of audio streams are selected from the voice audio streams according to the sequence from high to low of the real-time sound intensity and sent to each client.

How to recognize the voice audio stream and the non-voice audio stream based on the detection result mark can be selected based on actual requirements, which is not limited by the embodiment of the present invention. As an example, if the detection result flag in each RTP packet included in the current receiving time window of a certain audio stream is a non-voice flag, the certain audio stream may be considered as a non-voice audio stream, otherwise, the certain audio stream is a voice audio stream.

After receiving the audio streams 1 to 5, if the detection result marks carried in the RTP packets of the audio stream 5 are all 0, the SFU server cluster identifies the audio stream 5 as a non-voice audio stream, and identifies the audio streams 1 to 4 as voice audio streams. According to the sound intensity carried in the RTP packet of each of the audio streams 1 to 4, determining the real-time sound intensity of each audio stream, selecting the first 3 audio streams, namely the audio streams 2 to 4, from the audio streams according to the sequence of the real-time sound intensity from high to low, and sending the audio streams 2 to 4 to a plurality of clients.

In the embodiment of the invention, by carrying the detection result mark of voice endpoint detection in the RTP packet of the audio stream, the SFU server cluster can reject the non-voice audio stream from the RTP packet based on the detection result mark in advance before sequencing and screening the received audio streams based on the voice intensity, thereby reducing the processing burden of the SFU server cluster, avoiding the non-voice audio stream from being forwarded to the client terminal by the SFU server cluster and ensuring the audio quality of the target audio stream.

In one embodiment of the present invention, SFU server nodes in the SFU server cluster adopt a hierarchical structure, and audio streams are transmitted step by step within the SFU server cluster, and according to a transmission sequence of the audio streams, the SFU server nodes in the SFU server cluster at least include: a primary or multi-level audio stream screening server node and a primary summary stream selection server node;

The foregoing determining, for each audio stream, the real-time sound intensity of the audio stream based on the sound intensity carried in each RTP packet included in the audio stream, and selecting, according to the order in which the real-time sound intensity is from high to low, a first preset number of audio streams from the received audio streams, including:

the audio stream screening server node receives audio streams sent by a client side and/or an audio stream screening server node at the upper stage, determines the real-time sound intensity of each channel of audio stream based on the sound intensity carried in each RTP packet contained in the channel of audio stream for each channel of audio stream, and selects a first preset number of channels of audio streams from the received audio streams to send to an SFU server node at the lower stage according to the sequence from high to low of the real-time sound intensity;

the method comprises the steps that a summarizing stream selection server node receives audio streams sent by each audio stream screening server node of the upper stage, determines real-time sound intensity of each audio stream based on sound intensity carried in each RTP packet contained in each audio stream, and selects a first preset number of audio streams from received audio streams according to the sequence from high to low of the real-time sound intensity and sends the first preset number of audio streams to a client.

In the embodiment of the invention, each path of audio stream entering the SFU server is screened step by step through SFU server nodes of a plurality of different levels in the SFU server cluster, wherein the SFU server comprises a one-stage or multi-stage audio stream screening server and a one-stage summarizing stream selecting server.

In the forwarding process of the audio streams, aiming at the audio streams received by each audio stream screening server node in the SFU server cluster, determining the real-time sound intensity of each audio stream based on the sound intensity carried in the RTP packet of the audio stream, and selecting the first preset number of audio streams from the audio streams according to the sequence from high to low of the real-time sound intensity of the audio streams received by each audio stream, and sending the first preset number of audio streams to the audio stream screening server node of the next stage. And each summarizing stream selection server node can specifically receive the audio streams sent by each audio stream screening server node at the previous stage and perform summarizing stream selection on the audio streams, so that the whole SFU server cluster can be finally ensured to select the first preset number of audio streams with highest sound intensity in the audio streams issued by each client.

In the embodiment of the present invention, any SFU server node specifically sends the selected audio stream to which audio stream filtering server node of the next stage, which may refer to the content in the related art, and the embodiment of the present invention is not limited thereto. As one example, each SFU server node may maintain its forwarding table based on information such as the geographical location of the node, and the link state between the next-hop nodes, and determine the next-hop node for the audio stream based on the forwarding table.

It should be noted that the SFU server nodes in the SFU server cluster according to the embodiments of the present invention are not limited to the above-described audio stream screening server nodes and summary routing server nodes. Specifically, the actual SFU server cluster may have more levels of SFU server nodes for forwarding the audio stream, and filtering the audio stream may be achieved through the first several levels of SFU server nodes, where the remaining SFU server nodes may directly forward the received audio stream.

As an example, the SFU server cluster may specifically use a Push-Relay-Play architecture, where, in order of propagation of the audio stream, the SFU server cluster sequentially includes: a first-level Push server node, a first-level or multi-level Relay server node, and a first-level Play server node. In this case, the screening of the audio stream may be specifically achieved through the first three stages of SFU server nodes, where the audio stream screening server may specifically include a Push server node located at the first stage and a Relay server node located at the second stage, and the summary stream selection server node is specifically a Relay/Play server node located at the third stage.

Fig. 3 is another schematic diagram of an instant messaging system provided in an embodiment of the present invention, and a specific process of audio stream screening is described below with reference to fig. 3, where the specific process includes 1000 clients from client 1 to client 1000, and SFU server nodes for performing audio stream screening in an SFU server cluster specifically include SFU server nodes of three levels of Push, relay and Relay/Play. Taking the first preset number of 3 as an example, in the communication process, the client 1 to the client 1000 send respective audio streams to the Push server nodes, each Push server node selects 3 paths of first audio streams from the audio streams received by itself based on the real-time sound intensity of the audio streams and sends the selected audio streams to the Relay server node, and the Push server node shown in fig. 3 specifically includes 34 SFU server nodes from SFU1-1 to SFU1-34, so that 102 paths of audio streams can be selected in total and sent to the Relay server node.

Each Relay server node also selects 3 paths of audio streams from the audio streams received by the Relay server node to send to the Relay/Play server node based on the real-time sound intensity of the audio streams after receiving the audio streams sent by the Push server node at the previous stage, and the Relay server node shown in fig. 3 specifically comprises 4 SFU server nodes from SFU2-1 to SFU2-4, so that 12 paths of audio streams can be selected from the received 102 paths of audio streams in total to send to the SFU3-1 shown in the figure. SFU3-1 selects the first 3 audio streams from the received 12 audio streams according to the order of the real-time sound intensity from high to low, and sends the selected 3 audio streams to each of clients 1 to 1000.

In the embodiment of the invention, the SFU server nodes in the SFU server cluster specifically adopt a hierarchical structure, and each SFU server node can screen the audio stream received by the SFU server node and forward the screened audio stream to the SFU server node of the next stage. Under the condition that a first preset number of audio streams with highest real-time sound intensity are required to be selected, the audio streams input to the SFU server cluster are screened step by step based on the real-time sound intensity of the audio streams through SFU server nodes of different levels, so that the processing burden of the SFU server nodes can be further reduced, the bandwidth is saved, and an instant messaging system can support a larger number of clients to participate in the instant messaging process at the same time.

In one embodiment of the present invention, the instant messaging method further includes:

aiming at any stage of audio stream screening server node, determining the maximum number of audio streams input into the stage of audio stream screening server node based on the number of clients in the communication system and the first preset number;

Specifically, in order to ensure reasonable utilization of server resources in the SFU server cluster, a preset pull-stream upper limit of a single SFU server node may be predetermined, and based on this, the number of SFU server nodes at each level in the SFU server cluster is configured in combination with the number of clients in the instant messaging system. The preset pull-up upper limit may be selected based on actual requirements, which is not limited by the embodiment of the present invention. As one example, the preset pull-up limit may be determined based on the actual operational capabilities of the SFU server node.

The preset pull-stream upper limit can be understood as a maximum number of audio streams received by a single audio stream screening server node, and on the basis, the total number of audio streams which are transmitted to the single SFU server node in parallel should not exceed the preset pull-stream upper limit in the actual application process of the instant messaging system.

Under the condition that the preset pull-stream upper limit is set, the number of the audio stream screening server nodes of any stage can be configured based on the preset pull-stream upper limit and the maximum number of the audio streams input to the audio stream screening server nodes of the stage. As an example, the ratio of the maximum number to the preset pull stream upper limit may be calculated and rounded, and the rounded result is the number of the level audio stream filtering server nodes to be configured.

Wherein, for the audio stream filtering server node of the first stage, for example, the Push server node in fig. 3, since each client can send a path of audio stream, the maximum number of audio streams input into the audio stream filtering server node of the first stage is the number of clients in the instant messaging system; for other levels of audio stream filtering server nodes, for example, the Relay server node in fig. 3, since each audio stream filtering server node in the previous level may select a first preset number of audio streams to forward, the maximum number of audio streams input into the audio stream filtering server node in the previous level is specifically the product of the number of audio stream filtering server nodes in the previous level and the first preset number.

In the process of deploying SFU server nodes in the SFU server cluster, the number of the first-stage audio stream screening server nodes can be determined based on the number of clients in the instant messaging system and a preset pull stream upper limit. On the basis of determining the number of the first-stage audio stream screening server nodes, the maximum number of audio streams input into the second-stage audio stream screening server nodes can be determined by combining the first preset number, so that the number of the second-stage audio stream screening server nodes can be determined by combining the preset pull upper limit, and the number of the next-layer audio stream screening server nodes can be determined sequentially by analogy, and SFU server cluster deployment can be performed subsequently on the basis.

As an example, in practical applications, a user typically needs to create a conversation room, such as a voice conference room, while conducting instant messaging. Therefore, in the embodiment of the invention, the number of clients in the instant messaging system can be determined based on the number of participants in the room creation process or the size of the session room designated by the homeowner, and the number of audio stream screening server nodes of each level is determined by combining with the preset pull upper limit on the basis, so that the SFU server cluster is deployed.

With respect to the previous example, in the instant messaging system shown in fig. 3, the number of clients is specifically 1000, if the preset pull-stream upper limit is 30, the upward rounding result of the ratio of 1000 to 30 is specifically 34, so that 34 Push server nodes are specifically deployed, that is, SFU1-1 to SFU1-34 are illustrated. When the first preset number is specifically 3, the 34 Push server nodes can select 102 paths of audio streams to output to the Relay server nodes, and the upward rounding result of the ratio of 102 to 30 is 4, so that 4 Relay server nodes are deployed, namely SFU2-1 to SFU2-4 are illustrated. The 4 Relay server nodes can select 12 paths of audio streams to output to one Relay/Play server node, namely SFU3-1 is shown in the figure, and SFU3-1 selects 3 paths of audio streams to output to each client.

In the embodiment of the invention, a preset pull-stream upper limit is preset for each SFU server cluster in the SFU server clusters, and when the SFU server clusters are actually deployed, the number of the audio stream screening server nodes of any level is configured based on the maximum number of audio streams input into the audio stream screening server nodes of the level and the preset pull-stream upper limit, so that each path of audio stream input into any level of audio stream screening server node can be ensured to be uniformly forwarded to each SFU server node of the level, and the resource utilization rate of the SFU server nodes in the SFU server clusters can be ensured.

In one embodiment of the present invention, a plurality of candidate pull stream upper limits are preconfigured for the SFU server node, and each candidate pull stream upper limit corresponds to an expected number of intervals in advance, and the instant messaging method further includes:

determining an expected number interval in which the actual number of clients in the communication system is located in a plurality of expected number intervals, and setting a candidate pull stream upper limit corresponding to the expected number interval as a preset pull stream upper limit;

Based on the load balancing condition, deducing the mapping relation between the average pulling flow number and the number of clients in the instant messaging system; the load balancing conditions are as follows: the average number of the pull streams of each stage of SFU server nodes is consistent, and the average number of the pull streams represents the ratio of the maximum number of audio streams input into the stage of SFU server nodes to the number of the stage of SFU server nodes;

In practical application, in order to ensure the resource utilization rate of each SFU server node in the SFU server cluster, load balancing among SFU server nodes of different levels should be ensured, that is, the average number of pull streams of the SFU server nodes of each level is consistent, that is, the load balancing condition in the embodiment of the invention. And, filter the server node to the audio frequency stream of the arbitrary level, the average pulling stream quantity of this level SFU server node is specifically: the ratio of the maximum number of audio streams input to the stage SFU server node to the number of stage SFU server nodes. The number of the single node pull streams is substantially the product of the number of the audio stream screening server nodes of the previous stage and the first preset number because the audio streams outputted by each audio stream screening server node of the previous stage are required to be summarized and ordered among the summarized stream screening server nodes.

It should be noted that in the embodiment of the present invention, since the summary stream selection server node is specifically configured to implement summary routing of audio streams in the SFU server cluster. Therefore, if the preset pull-up limit is arbitrarily specified, the bandwidth utilization of the summary stream selection server node may be lower.

Specifically, for any stage of audio stream screening server node or summary stream selection server node, the audio stream input to the stage of audio stream screening server node is specifically the audio stream screened by the audio stream screening server node of the previous stage. Thus, the server nodes are screened for audio streams at any two adjacent levels in the SFU server cluster, and an equation can be determined based on the load balancing conditions described above. In addition, the load balancing condition requires that the average number of pull streams of the SFU server nodes at each stage be consistent, so that the average number of pull streams of the SFU server nodes at each stage in each equation can be unified into one variable.

And combining the equations, eliminating the number of SFU server nodes involved in the load balancing condition as an intermediate parameter to obtain the mapping relation between the average pulling flow number and the client number. That is, for the instant messaging system according to the embodiment of the present invention, it is required to ensure that the SFU server nodes of different levels meet load balancing requirements, and the number of average pull streams and the number of clients meet the specific mapping relationship.

In the embodiment of the invention, because the specific number of the SFU server nodes is determined based on the preset pull stream upper limit, the actual average pull stream number of the SFU server nodes at each level is close to the preset pull stream upper limit in the process of actually applying the instant messaging system. Therefore, in order to meet the load balancing among SFU server nodes at all levels and ensure the resource utilization rate of the SFU server nodes in the SFU server cluster, the average pulling flow quantity capable of ensuring the load balancing requirement can be determined according to the quantity of clients in the instant messaging system based on the mapping relation, and the preset pulling flow upper limit is configured differently on the basis.

In the embodiment of the present invention, a plurality of candidate pull stream upper limits may be preset, each candidate pull stream upper limit is substituted into the mapping relationship, an expected number of clients corresponding to each candidate pull stream upper limit is determined, and an expected number interval of clients corresponding to each candidate pull stream upper limit is determined based on the expected number.

After determining the expected number interval corresponding to each candidate pull stream upper limit, in the process of actually applying the instant messaging method provided by the embodiment of the invention, based on the expected number interval in which the actual number of clients in the instant messaging system is specifically located, or the expected number interval in which the upper microphone number in the session room is located, one candidate pull stream upper limit corresponding to the expected number interval is used as the preset pull stream upper limit of a single SFU server node in the instant messaging system, and the specific number of SFU server nodes at each level is determined based on the description in the previous embodiment of the invention.

The number of the candidate pull stream upper limits and the size of each candidate pull stream upper limit can be selected based on actual requirements, which is not limited by the embodiment of the present invention, but the maximum value should not exceed the operation capability of the SFU server node. As one example, 5 candidate pull stream upper limits may be determined, the first four being 20, 30, 40, and 50, respectively, and the largest candidate pull stream upper limit may then be set based on the computational capabilities of the SFU server node.

This process is specifically described below using the SFU server cluster with a three-level structure as an example.

The number of clients in the instant messaging system is specifically N, the number of Push server nodes is a, and the number of Relay server nodes is B. The client side shares N paths of audio streams to Push to the Push server node of the first stage, and the average load path number of the Push server node is N/A. On the basis, if the first preset number is 3, each Push server node selects 3 paths of audio streams to output to the Relay server node, so that the total 3A paths of audio streams are input to the Relay server node, and the average load path number of the Relay server node is 3A/B. Each Relay server node selects 3 paths of audio streams to one Relay/Play server node, so that the average load path number of the Relay/Play server node is 3B. If the average number of pull streams is characterized by m, the load balancing condition can be expressed as:

The mapping relation between m and N can be obtained by eliminating A, B by the above formula, specifically:

in the previous receiving example, if the preset candidate pull-up limit includes: 20. 30, 40, 50 and P _max Wherein P is _max Is determined based on the computational capabilities of the SFU server node. Substituting the specific value into m in the above formula, the expected number N of clients corresponding to each candidate pull stream upper limit can be obtained, and the expected number interval corresponding to each candidate pull stream upper limit is determined on the basis of N. Taking 20 as an example, the expected number N of corresponding clients may be calculated to be 888.8, so 0-800 may be used as the number interval corresponding to the candidate pull stream upper limit 20.

On the basis, in an actual application scene, the preset pull stream upper limit of the SFU server node can be specifically configured by the following conditions:

case 1: if the number of the upper microphone is less than 800, setting the candidate pull-up upper limit 20 as a preset pull-up upper limit;

case 2: if 800< the number of the upper microphone is <3000, setting the candidate pull-up upper limit 30 as a preset pull-up upper limit;

case 3: if 3000< the number of passengers <7000, setting the candidate pull-up upper limit 40 as a preset pull-up upper limit;

case 4: if 7000< the number of the upper microphone <13300, setting the candidate pull-up limit 50 as a preset pull-up limit;

Case 5: if 13300< number of upper microphone, limiting the preset pull-up limit according to the operation capability of the SFU server node.

For the instant messaging system shown in fig. 3, there are 1000 clients, which are the case 2 described above, and therefore, the preset pull-up limit is configured to be 30. Based on the above, the deployment mode of the SFU server node in the SFU server cluster illustrated in FIG. 3 can be determined by combining the input total flow numbers corresponding to the Push server node and the Relay server node.

Fig. 4 is a schematic diagram of an instant messaging system according to an embodiment of the present invention, in which a total of 10000 clients from client 1 to client 10000 belong to the above-mentioned case 5, and therefore, the preset pull-up limit is configured to be 50. The ratio of 10000 to 50 is 200, so that 200 Push server nodes from SFU1-1 to SFU1-200 are specifically deployed, and when the first preset number is specifically 3, the 200 Push server nodes can select 600 paths of audio streams to output to the Relay server node. The ratio of 600 to 50 is 12, so that 12 Relay server nodes from SFU2-1 to SFU2-12 are specifically deployed therein, the 12 Relay server nodes can select 36 audio streams to Relay/Play server node SFU3-1, and the SFU3-1 completes the final stream selection operation for the 36 audio streams.

As an example, in order to further improve the adaptability of the instant messaging system, in the case that the number of client terminals joining the instant messaging system increases, the preset pull-up upper limit may also be adaptively adjusted, and the capacity expansion processing may be performed on the instant messaging system. Specifically, if the number of client terminals added to the instant messaging system is increased, the preset number interval in which the client terminals are located is changed, the current preset pull stream upper limit may be raised to a candidate pull stream upper limit corresponding to the preset number interval in which the number of the client terminals after the change is located, based on the content of the foregoing embodiment of the present invention, the number of SFU server nodes adapted to the raised preset Shan Jiedian pull stream upper limit is determined by combining the total input path number of the SFU server nodes at each stage, and if the determined number is greater than the number of SFU server nodes currently deployed, additional SFU server nodes may be deployed for the instant messaging system based on the determined number.

In the embodiment of the invention, a plurality of candidate pull stream upper limits are preset, the mapping relation between the candidate pull stream upper limits and the number of clients in the instant messaging system is preset based on load balancing conditions, and the expected number interval corresponding to each candidate pull stream upper limit is determined. When the SFU server cluster in the instant messaging system is deployed, the corresponding candidate pull stream upper limit can be set to be the preset pull stream number based on the expected number interval where the actual number of the clients in the instant messaging system is located, so that load balancing among SFU server nodes of different levels is ensured, and further resource utilization rate of each SFU server node in the SFU server cluster is ensured.

Based on the same inventive concept, the embodiment of the present invention further provides an instant messaging method, which is specifically applied to a client in an instant messaging system, where the instant messaging system includes a plurality of clients and an SFU server cluster, and the plurality of clients are respectively connected to the SFU server cluster in a communication manner, and this instant messaging system may be specifically described in reference to fig. 1.

Fig. 5 is another flow chart of the instant messaging system provided by the embodiment of the present invention, referring to fig. 5, the method specifically includes the following steps:

step S501: transmitting audio streams to the SFU server cluster, wherein each RTP packet of the audio streams carries sound intensity corresponding to the RTP packet, so that the SFU server determines real-time sound intensity corresponding to each path of audio stream based on sound intensity carried by each RTP packet contained in the receiving time window of the path of audio stream aiming at each path of audio stream received in the current receiving time window, and selects a first preset number of paths of audio streams from the received audio streams to transmit to a plurality of clients according to the sequence from high to low of the real-time sound intensity;

step S502: and receiving the audio streams sent by the SFU server cluster, and performing audio mixing processing on the received audio streams to obtain target audio streams for playing.

Specifically, the contents of steps S501-S502 correspond to steps S201-S203 in the foregoing, and thus reference may be made to the foregoing description, which is not repeated here.

In the embodiment of the invention, because the SFU server cluster can determine the real-time sound intensity of each path of audio stream based on the sound intensity carried in the RTP packet of each path of audio stream after receiving the audio stream sent by the client terminal, the audio stream with higher real-time sound intensity is screened from the SFU server cluster to be selectively forwarded, thereby avoiding the total forwarding of the received audio stream by the SFU server, and reducing the processing burden and bandwidth consumption of the SFU server cluster. On the basis of low processing burden of the SFU server cluster, the instant messaging system can also support a larger number of clients to participate in the instant messaging process at the same time, and the interactive experience in the instant messaging process is improved.

In one embodiment of the present invention, the RTP packet of the audio stream also carries a unique identifier UID of the client that sends the audio stream;

the above-mentioned received audio streams are subjected to audio mixing processing to obtain a target audio stream for playing, which includes:

if yes, removing the path of audio stream from the received audio stream, and carrying out audio mixing processing on the other paths of audio streams to obtain a target audio stream;

if not, carrying out audio mixing processing on all received audio streams to obtain a target audio stream.

In the embodiment of the present invention, when each client terminal sends its own audio stream to the SFU server cluster, its own UID (User Identification) may be carried in the RTP packet of the audio stream, so as to identify the source of the audio stream. As one example, the UID may be carried within an RTP packet extension header.

After the SFU server cluster sorts the received audio streams according to the order from high to low of the real-time sound intensity of the audio streams, and selects the first preset number of audio streams from the sorted audio streams to input the audio streams to each client, the client can judge whether the audio streams selected by the SFU server cluster contain the audio streams sent by the client or not based on UIDs carried in RTP packets of each received audio stream. Specifically, if the UID carried in the RTP packet of one audio stream is consistent with the UID of the audio stream, the audio stream is characterized as the audio stream sent by the user.

After receiving the audio stream forwarded by the SFU server, the client can decode and mix all the received audio streams to obtain a target audio stream for playing if the received audio stream does not contain the audio stream sent by the client. If the received audio stream contains the audio stream sent by the client, the client deletes the audio stream before decoding the audio stream, and decodes and mixes the remaining audio streams to obtain the target audio stream without the self sound.

With respect to the previous example, in the instant messaging system shown in fig. 1, after receiving the audio streams 1 to 5, if the SFU server cluster determines the real-time sound intensities of the audio streams 1 to 5 based on the sound intensities carried in the audio data packets of each audio stream, and selects 3 audio streams from the real-time sound intensities: audio stream 3, audio stream 4 and audio stream 5, these 3 audio streams are sent to clients 1 to 5. Also, the RTP packets of the audio stream 3, the audio stream 4, and the audio stream 5 carry UIDs of the client 3, the client 4, and the client 5, respectively.

After receiving the audio streams 3 to 5, the client 1 determines that the UIDs carried in the RTP packets of the 3 audio streams are inconsistent with the UID of the client 1, and mixes the audio streams 3 to 5. And the client 5 determines that the UID carried in the RTP packet of the audio stream 5 is consistent with the UID of itself after receiving the audio stream 3 to the audio stream 5, so that the audio stream 5 is deleted before decoding, and the audio stream 3 and the audio stream 4 are subjected to audio mixing processing.

In the embodiment of the invention, since the RTP packet of the audio stream carries the UID of the client for sending the audio stream, after the SFU cluster sends the selected audio stream to each client, each client can judge whether the audio stream contains the audio stream sent by itself or not based on the UID carried in the RTP packet of the received audio stream, delete the audio stream if the received audio stream is confirmed to be yes, and mix the rest audio streams to obtain personalized audio output without containing own voice.

When the client deletes the audio stream sent by the SFU server cluster, the SFU server cluster does not need to pay attention to the source of the selected audio stream when forwarding the selected audio stream, and the SFU server cluster only needs to forward the selected audio stream to each client directly to order and forward the differentiated audio stream aiming at different clients, so that the processing load of the SFU server can be reduced.

Fig. 6 is a schematic diagram of an extension header of an RTP packet according to an embodiment of the present invention, and in one possible implementation manner of the present invention, for the instant messaging method according to any of the foregoing embodiments, the RTP packet may specifically use the extension header illustrated in fig. 6 to carry information such as a volume value, a voice activity identifier, and a UID of a client of an audio stream.

Referring to fig. 6, the extension header of the RTP packet specifically includes the following fields:

ID field: and 8 bits long for filling the UID of the client transmitting the RTP packet.

The len field: 8 bits long, for indicating the byte length of the latter field, the length of the V field following the len field is specifically 1 bit, and therefore len=1 in the extension header.

V field: namely, the detection result mark mentioned above, if the filling is 0, the load representing the RTP packet specifically carries a non-voice signal, and if the filling is 1, the load representing the RTP packet specifically carries a voice signal.

level field: 7 bits long for filling the sound intensity of the signal carried within the payload of the RTP packet. The field may be filled with a value between 0-127, representing a volume value of-127 to 0dBov (dB overload, a volume unit).

pad field: and 8 bits long, for ensuring that the byte length of the RTP packet meets the minimum length requirement, and is padded with 0.

Therefore, when a client in the instant communication system transmits an audio stream by using the RTP packet of the extension header, the UID, the sound intensity, and the detection result flag of the client can be carried in the extension header. After receiving the audio streams sent by each client, the SFU server cluster may reject the non-voice audio stream with v=1 based on the V field in the RTP packet of each audio stream, determine the real-time sound intensity corresponding to each audio stream based on the sound intensity filled in the level field in the RTP of each voice audio stream, select the first preset number of audio streams from the audio streams according to the order of the real-time sound intensity from high to low, and send the selected audio streams to each client.

After receiving the audio streams forwarded by the SFU server cluster, the client judges whether the audio streams contain the audio streams sent by the client or not based on the UID filled in the ID field in the RTP packet of each audio stream, if so, the client deletes the audio streams and mixes the rest audio streams, otherwise, mixes all the audio streams.

Based on the same inventive concept, the embodiment of the invention also provides an instant messaging system, which comprises an SFU server cluster and a plurality of clients, wherein the clients are respectively connected to the SFU server cluster in a communication way;

the client is configured to send an audio stream to the SFU server cluster; each RTP packet of the audio stream carries the sound intensity corresponding to the RTP packet;

the SFU server cluster is configured to receive audio streams sent by a plurality of clients in a current receiving time window, determine real-time sound intensity of the audio stream based on sound intensity carried in each RTP packet contained in the receiving time window of each audio stream, and select a first preset number of audio streams from the received audio streams according to the sequence from high to low of the real-time sound intensity and send the first preset number of audio streams to the plurality of clients;

Based on the same inventive concept, the embodiment of the present invention further provides an instant messaging device, which is applied to an SFU server cluster in an instant messaging system, the instant messaging system further includes a plurality of clients, and the plurality of clients are respectively connected to the SFU server cluster in a communication manner, referring to fig. 7, the device includes:

a receiving module 701, configured to receive audio streams sent by a plurality of clients in a current receiving time window; the audio stream carries the sound intensity corresponding to each RTP packet contained in the receiving time window;

the selection module 702 is configured to determine, for each audio stream, a real-time sound intensity of the audio stream based on a sound intensity carried in each RTP packet included in the audio stream, and select a first preset number of audio streams from the received audio streams according to a sequence of the real-time sound intensity from high to low;

the first sending module 703 is configured to send the selected audio stream to a plurality of clients, so that the clients perform mixing processing on each received audio stream to obtain a target audio stream for playing.

Based on the same inventive concept, the embodiment of the present invention further provides an instant messaging device, which is applied to a client in an instant messaging system, where the instant messaging system includes a plurality of clients and an SFU server cluster, and the plurality of clients are respectively connected to the SFU server cluster in a communication manner, and referring to fig. 8, the device includes:

a second sending module 801, configured to send an audio stream to the SFU server cluster, where each RTP packet of the audio stream carries a sound intensity corresponding to the RTP packet, so that the SFU server determines, for each audio stream received in a current receiving time window, a real-time sound intensity of the audio stream based on the sound intensities carried by the audio stream in each RTP packet included in the receiving time window, and selects, according to a sequence from high to low of the real-time sound intensity, a first preset number of audio streams from the received audio streams, and sends the first preset number of audio streams to multiple clients;

And the mixing module 802 is configured to receive an audio stream sent by the SFU server cluster, and perform mixing processing on each received audio stream to obtain a target audio stream for playing.

The embodiment of the present invention also provides an electronic device, as shown in fig. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 perform communication with each other through the communication bus 904,

a memory 903 for storing a computer program;

the processor 901 is configured to execute a program stored in the memory 903, and implement the following steps:

receiving audio streams sent by a plurality of clients in a current receiving time window; the audio stream carries the sound intensity corresponding to each RTP packet contained in the receiving time window;

for each path of audio stream, determining the real-time sound intensity of the path of audio stream based on the sound intensity carried in each RTP packet contained in the path of audio stream, and selecting a first preset number of paths of audio streams from the received audio streams according to the sequence from high to low of the real-time sound intensity;

and sending the selected audio streams to a plurality of clients so that the clients can mix the received audio streams to obtain target audio streams for playing.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer readable storage medium is provided, where a computer program is stored, the computer program implementing the instant messaging method according to any of the above embodiments when executed by a processor.

In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform the instant messaging method of any of the above embodiments is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for instant messaging system, apparatus, electronic device, and computer-readable storage medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, and relevant references are made to the partial description of the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. An instant messaging method, characterized in that the instant messaging method is applied to an SFU server cluster in an instant messaging system, the instant messaging system further comprises a plurality of clients, and the clients are respectively connected to the SFU server cluster in a communication manner, the method comprises:

2. The method of claim 1, wherein SFU server nodes in the SFU server cluster are in a hierarchical structure, and wherein the audio streams are transmitted stepwise within the SFU server cluster, and wherein the SFU server nodes in the SFU server cluster at least comprise, in a transmission order of the audio streams: a primary or multi-level audio stream screening server node and a primary summary stream selection server node;

3. The method of claim 1, wherein RTP packets of the audio stream further carry a detection result flag for voice endpoint detection, and wherein the detection result flag carried in each RTP packet is a voice flag or a non-voice flag;

4. The method according to claim 2, wherein the method further comprises:

5. The method of claim 4, wherein a plurality of candidate pull stream upper limits are preconfigured for the SFU server node, and each of the candidate pull stream upper limits corresponds in advance to an expected number of intervals, the method further comprising:

6. An instant messaging method, applied to a client in an instant messaging system, the instant messaging system including a plurality of clients and an SFU server cluster, the plurality of clients being respectively communicatively connected to the SFU server cluster, the method comprising:

7. The instant messaging method of claim 6 wherein the RTP packets of the audio stream further carry a unique identifier UID of the client sending the audio stream;

8. An instant messaging system, comprising an SFU server cluster and a plurality of clients, wherein the clients are respectively connected to the SFU server cluster in a communication mode;

9. An instant messaging apparatus, applied to an SFU server cluster in an instant messaging system, the instant messaging system further comprising a plurality of clients, the plurality of clients being respectively communicatively connected to the SFU server cluster, the apparatus comprising:

10. An instant messaging apparatus, applied to a client in an instant messaging system, the instant messaging system including a plurality of clients and SFU server clusters, the plurality of clients being respectively communicatively connected to the SFU server clusters, the apparatus comprising:

11. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 or 6-7 when executing a program stored on a memory.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5 or 6-7.