CN112751849A

CN112751849A - Conference control method, device, equipment and medium based on video networking

Info

Publication number: CN112751849A
Application number: CN202011590205.1A
Authority: CN
Inventors: 乔金龙; 杨春晖; 王艳辉; 牛永会
Original assignee: Visionvera Information Technology Co Ltd
Current assignee: Visionvera Information Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-05-04
Anticipated expiration: 2040-12-28
Also published as: CN112751849B

Abstract

The embodiment of the invention provides a video networking-based conference control method, a video networking-based conference control device, electronic equipment and a storage medium, which are applied to the technical field of communication and aim to improve the efficiency of conference control; the method is applied to a server and comprises the following steps: receiving an encoder control instruction obtained after performing video networking protocol conversion on a mute request sent by the conference control terminal; determining a first participant terminal and at least one second participant terminal corresponding to the encoder control instruction; searching a target mixed audio stream sent to a first participant terminal from the current multi-channel mixed audio stream; and updating the decoding marks of the audio streams belonging to the second participant terminals in the target mixed audio stream into preset marks so that the first participant terminal discards the audio streams belonging to the second participant terminals in the target mixed audio stream according to the preset marks.

Description

Conference control method, device, equipment and medium based on video networking

Technical Field

The present invention relates to the field of network communication technologies, and in particular, to a method, an apparatus, a device, and a medium for controlling a conference based on a video network.

Background

The video networking adopts the most advanced worldwide Vision Vera real-time high-definition video exchange technology, realizes the real-time transmission of the whole-network high-definition video which cannot be realized by the current Internet, integrates dozens of services such as high-definition video conferences, video monitoring, remote training, intelligent monitoring analysis, emergency command, video telephone, live broadcast, television mails, information distribution and the like into a system platform, and realizes the real-time interconnection and intercommunication of high-definition quality video communication through various terminal devices.

In the related art, in a video networking and internet held conference, some participant terminals are required to be unable to hear some speaking contents of some important terminals or some participant terminals are not allowed to speak in the process of the conference. Based on the requirement, the microphone of the terminal which is not allowed to speak is generally manually turned off, or the audio channel between the important terminal and other terminals is cut off, however, in this way, professional conference service personnel are required to be equipped and the operation is cumbersome, which affects the normal operation of the conference.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for controlling a conference based on a video network, so as to overcome or at least partially solve the above problems.

The embodiment of the invention discloses a video network-based conference control method, which comprises a conference control terminal and a server, wherein the conference control terminal is positioned in the internet; the method is applied to the server and comprises the following steps:

receiving an encoder control instruction obtained after performing video networking protocol conversion on a mute request sent by the conference control terminal;

determining a first participant terminal and at least one second participant terminal corresponding to the encoder control instruction;

searching a target mixed audio stream sent to the first participating terminal from the current multi-channel mixed audio stream; each mixed audio stream in the multiple paths of mixed audio streams is an audio stream obtained by mixing audio streams sent by a plurality of current speaking terminals;

and updating the decoding mark of the audio stream belonging to the at least one second participating terminal in the target mixed audio stream into a preset mark, so that the first participating terminal discards the audio stream belonging to the at least one second participating terminal in the target mixed audio stream according to the preset mark.

Optionally, updating the decoding flag of the audio stream belonging to the at least one second participant terminal in the target mixed audio stream to a preset flag includes:

determining a target audio stream belonging to the at least one second participant terminal in the target mixed audio stream;

and setting the decoding mark of each target audio stream as a preset mark, and keeping the decoding marks of other audio streams except the target audio stream in the target mixed audio stream unchanged.

Optionally, the mute request further carries a mute duration, and the encoder control instruction includes the mute duration; the method further comprises the following steps:

generating an audio recovery instruction when the sending duration of the target mixed audio stream after the decoding mark is updated is determined to reach the mute duration;

and restoring the decoding marks of the audio streams belonging to the at least one second participating terminal in the target mixed audio stream into decodable marks based on the audio restoring instruction, so that the first participating terminal decodes and plays the audio streams belonging to the at least one second participating terminal in the target mixed audio stream.

Optionally, the method further comprises:

receiving a video networking speech control instruction obtained by performing video networking protocol conversion on the speech control instruction sent by the conference control end; the video network speech control instruction carries an identifier and speech duration of a third participant terminal to be controlled;

responding to the video networking speaking control instruction, and recording the sending duration of the audio stream belonging to the third participant terminal in the multi-channel mixed audio stream from the current moment;

and when the sending time of the audio stream of the third participant terminal reaches the speaking time, updating the decoding mark of the audio stream belonging to the third participant terminal in the multi-channel mixed audio stream to the preset mark, so that other participant terminals in the video conference discard the audio stream belonging to the third participant terminal according to the preset mark.

Optionally, after updating the decoding flag of the audio stream of the multi-channel mixed audio stream belonging to the third participant terminal to the preset flag, the method further includes:

receiving a video networking speaking instruction obtained after video networking protocol conversion is carried out on the speaking request instruction sent by the conference control terminal, wherein the speaking request instruction is sent to the conference control terminal by the third participant terminal;

and responding to the video networking speaking instruction, and respectively restoring the decoding marks of the audio streams belonging to the third participant terminal in the multi-channel mixed audio stream into decodable marks so as to enable other participant terminals in the video conference to decode and play the audio streams belonging to the third participant terminal.

Optionally, the method further comprises:

receiving a video networking conference quitting instruction obtained by performing video networking protocol conversion on the conference quitting instruction sent by the conference control terminal, wherein the video networking conference quitting instruction carries an identifier of a fourth participant terminal quitting the video conference;

and responding to the video networking conference quit instruction, closing an audio channel between the audio channel and the fourth participant terminal, so that each path of mixed audio stream does not contain the audio stream of the fourth participant terminal.

In a second aspect of the embodiments of the present application, a conference control device based on a video network is provided, including a conference control end located in the internet, and a server connected to the conference control end, where the device is located in the server, including:

the request receiving module is used for receiving an encoder control instruction obtained after performing video networking protocol conversion on the mute request sent by the conference control terminal;

the determining module is used for determining a first participating terminal and at least one second participating terminal corresponding to the encoder control instruction;

the instruction response module is used for searching a target mixed audio stream sent to the first participating terminal from the current multi-channel mixed audio stream; each mixed audio stream in the multiple paths of mixed audio streams is an audio obtained after the video networking server performs audio mixing on audio streams sent by a plurality of current speaking terminals;

and the tag changing module is used for updating the decoding tag of the audio stream of the target mixed audio stream belonging to the at least one second participating terminal to a preset tag, so that the first participating terminal discards the audio stream belonging to the at least one second participating terminal in the mixed audio stream according to the preset tag.

The embodiment of the invention also discloses an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the conference control method based on the video networking in the first aspect or the second aspect when executing.

The embodiment of the invention also discloses a computer readable storage medium, which stores a computer program to enable a processor to execute the conference control method based on the video network according to the first aspect or the second aspect of the embodiment of the invention.

The embodiment of the invention has the following advantages:

in this embodiment, the server may search, based on the encoder control instruction, a target mixed audio stream sent to the first participating terminal from among the current multiple mixed audio streams, and update a decoding flag of an audio stream belonging to the second participating terminal in the target mixed audio stream to a preset flag, so that the first participating terminal discards the audio streams belonging to the second participating terminals in the target mixed audio stream according to the preset flag. Therefore, the first participating terminal can not play the audio stream of each second participating terminal any more, and the mute requirement of the second participating terminal on the first participating terminal is met. Because whether the audio stream is muted relative to the receiving end can be controlled by setting the decoding mark of the audio stream, the whole process of muting can be automatically carried out by each terminal, the problem of high labor cost caused by manually closing the audio channel is reduced, and the conference control efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a schematic networking diagram of a video network of the present invention;

FIG. 2 is a schematic diagram of a hardware architecture of a node server according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an access switch of the present invention;

fig. 4 is a schematic diagram of a hardware structure of an ethernet protocol conversion gateway according to the present invention;

FIG. 5 is a schematic structural diagram of a communication architecture of the present invention;

FIG. 6 is a schematic flow chart illustrating the steps of a video network-based conference control method according to the present invention;

fig. 7 is a schematic structural diagram of a conference control device based on a video network in the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

To better understand the embodiments of the present invention, the following description refers to the internet of view:

the video networking is an important milestone for network development, is a real-time network, can realize high-definition video real-time transmission, and pushes a plurality of internet applications to high-definition video, and high-definition faces each other.

The video networking adopts a real-time high-definition video exchange technology, can integrate required services such as dozens of services of video, voice, pictures, characters, communication, data and the like on a system platform on a network platform, such as high-definition video conference, video monitoring, intelligent monitoring analysis, emergency command, digital broadcast television, delayed television, network teaching, live broadcast, VOD on demand, television mail, Personal Video Recorder (PVR), intranet (self-office) channels, intelligent video broadcast control, information distribution and the like, and realizes high-definition quality video broadcast through a television or a computer.

some of the technologies applied in the video networking are as follows:

network Technology (Network Technology)

Network technology innovation in video networking has improved the traditional Ethernet (Ethernet) to face the potentially huge first video traffic on the network. Unlike pure network Packet Switching (Packet Switching) or network Circuit Switching (Circuit Switching), the Packet Switching is adopted by the technology of the video networking to meet the Streaming requirement. The video networking technology has the advantages of flexibility, simplicity and low price of packet switching, and simultaneously has the quality and safety guarantee of circuit switching, thereby realizing the seamless connection of the whole network switching type virtual circuit and the data format.

Switching Technology (Switching Technology)

The video network adopts two advantages of asynchronism and packet switching of the Ethernet, eliminates the defects of the Ethernet on the premise of full compatibility, has end-to-end seamless connection of the whole network, is directly communicated with a user terminal, and directly bears an IP data packet. The user data does not require any format conversion across the entire network. The video networking is a higher-level form of the Ethernet, is a real-time exchange platform, can realize the real-time transmission of the whole-network large-scale high-definition video which cannot be realized by the existing Internet, and pushes a plurality of network video applications to high-definition and unification.

Server Technology (Server Technology)

The server technology on the video networking and unified video platform is different from the traditional server, the streaming media transmission of the video networking and unified video platform is established on the basis of connection orientation, the data processing capacity of the video networking and unified video platform is independent of flow and communication time, and a single network layer can contain signaling and data transmission. For voice and video services, the complexity of video networking and unified video platform streaming media processing is much simpler than that of data processing, and the efficiency is greatly improved by more than one hundred times compared with that of a traditional server.

Storage Technology (Storage Technology)

The super-high speed storage technology of the unified video platform adopts the most advanced real-time operating system in order to adapt to the media content with super-large capacity and super-large flow, the program information in the server instruction is mapped to the specific hard disk space, the media content is not passed through the server any more, and is directly sent to the user terminal instantly, and the general waiting time of the user is less than 0.2 second. The optimized sector distribution greatly reduces the mechanical motion of the magnetic head track seeking of the hard disk, the resource consumption only accounts for 20% of that of the IP internet of the same grade, but concurrent flow which is 3 times larger than that of the traditional hard disk array is generated, and the comprehensive efficiency is improved by more than 10 times.

Network Security Technology (Network Security Technology)

The structural design of the video network completely eliminates the network security problem troubling the internet structurally by the modes of independent service permission control each time, complete isolation of equipment and user data and the like, generally does not need antivirus programs and firewalls, avoids the attack of hackers and viruses, and provides a structural carefree security network for users.

Service Innovation Technology (Service Innovation Technology)

The unified video platform integrates services and transmission, and is not only automatically connected once whether a single user, a private network user or a network aggregate. The user terminal, the set-top box or the PC are directly connected to the unified video platform to obtain various multimedia video services in various forms. The unified video platform adopts a menu type configuration table mode to replace the traditional complex application programming, can realize complex application by using very few codes, and realizes infinite new service innovation.

Networking of the video network is as follows:

the video network is a centralized control network structure, and the network can be a tree network, a star network, a ring network and the like, but on the basis of the centralized control node, the whole network is controlled by the centralized control node in the network.

As shown in fig. 1, the video network is divided into an access network and a metropolitan network.

The devices of the access network part can be mainly classified into 3 types: node server, access switch, terminal (including various set-top boxes, coding boards, memories, etc.). The node server is connected to an access switch, which may be connected to a plurality of terminals and may be connected to an ethernet network.

The node server is a node which plays a centralized control function in the access network and can control the access switch and the terminal. The node server can be directly connected with the access switch or directly connected with the terminal.

Similarly, devices of the metropolitan network portion may also be classified into 3 types: a metropolitan area server, a node switch and a node server. The metro server is connected to a node switch, which may be connected to a plurality of node servers.

The node server is a node server of the access network part, namely the node server belongs to both the access network part and the metropolitan area network part.

The metropolitan area server is a node which plays a centralized control function in the metropolitan area network and can control a node switch and a node server. The metropolitan area server can be directly connected with the node switch or directly connected with the node server.

Therefore, the whole video network is a network structure with layered centralized control, and the network controlled by the node server and the metropolitan area server can be in various structures such as tree, star and ring.

The access network part can form a unified video platform (the part in the dotted circle), and a plurality of unified video platforms can form a video network; each unified video platform may be interconnected via metropolitan area and wide area video networking.

Video networking device classification

1.1 devices in the video network of the embodiment of the present invention can be mainly classified into 3 types: server, exchanger (including Ethernet protocol conversion gateway), terminal (including various set-top boxes, code board, memory, etc.). The video network as a whole can be divided into a metropolitan area network (or national network, global network, etc.) and an access network.

1.2 wherein the devices of the access network part can be mainly classified into 3 types: node server, access exchanger (including Ethernet protocol conversion gateway), terminal (including various set-top boxes, coding board, memory, etc.).

The specific hardware structure of each access network device is as follows:

a node server:

as shown in fig. 2, the system mainly includes a network interface module 201, a switching engine module 202, a CPU module 203, and a disk array module 204;

the network interface module 201, the CPU module 203, and the disk array module 204 all enter the switching engine module 202; the switching engine module 202 performs an operation of looking up the address table 205 on the incoming packet, thereby obtaining the direction information of the packet; and stores the packet in a queue of the corresponding packet buffer 206 based on the packet's steering information; if the queue of the packet buffer 206 is nearly full, it is discarded; the switching engine module 202 polls all packet buffer queues for forwarding if the following conditions are met: 1) The port send buffer is not full; 2) the queue packet counter is greater than zero. The disk array module 204 mainly implements control over the hard disk, including initialization, read-write, and other operations on the hard disk; the CPU module 203 is mainly responsible for protocol processing with an access switch and a terminal (not shown in the figure), configuring an address table 205 (including a downlink protocol packet address table, an uplink protocol packet address table, and a data packet address table), and configuring the disk array module 204.

The access switch:

as shown in fig. 3, the network interface module mainly includes a network interface module (a downlink network interface module 301 and an uplink network interface module 302), a switching engine module 303 and a CPU module 304;

wherein, the packet (uplink data) coming from the downlink network interface module 301 enters the packet detection module 305; the packet detection module 305 detects whether the Destination Address (DA), the Source Address (SA), the packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id) and enters the switching engine module 303, otherwise, discards the stream identifier; the packet (downstream data) coming from the upstream network interface module 302 enters the switching engine module 303; the incoming data packet of the CPU module 304 enters the switching engine module 303; the switching engine module 303 performs an operation of looking up the address table 306 on the incoming packet, thereby obtaining the direction information of the packet; if the packet entering the switching engine module 303 is from the downstream network interface to the upstream network interface, the packet is stored in the queue of the corresponding packet buffer 307 in association with the stream-id; if the queue of the packet buffer 307 is nearly full, it is discarded; if the packet entering the switching engine module 303 is not from the downlink network interface to the uplink network interface, the data packet is stored in the queue of the corresponding packet buffer 307 according to the guiding information of the packet; if the queue of the packet buffer 307 is nearly full, it is discarded.

The switching engine module 303 polls all packet buffer queues and may include two cases:

if the queue is from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queued packet counter is greater than zero; 3) obtaining a token generated by a code rate control module;

if the queue is not from the downlink network interface to the uplink network interface, the following conditions are met for forwarding: 1) the port send buffer is not full; 2) the queue packet counter is greater than zero.

The rate control module 308 is configured by the CPU module 304, and generates tokens for packet buffer queues from all downstream network interfaces to upstream network interfaces at programmable intervals to control the rate of upstream forwarding.

The CPU module 304 is mainly responsible for protocol processing with the node server, configuration of the address table 306, and configuration of the code rate control module 308.

Ethernet protocol conversion gateway：

As shown in fig. 4, the apparatus mainly includes a network interface module (a downlink network interface module 401 and an uplink network interface module 402), a switching engine module 403, a CPU module 404, a packet detection module 405, a rate control module 408, an address table 406, a packet buffer 407, a MAC adding module 409, and a MAC deleting module 410.

Wherein, the data packet coming from the downlink network interface module 401 enters the packet detection module 405; the packet detection module 405 detects whether the ethernet MAC DA, the ethernet MAC SA, the ethernet length or frame type, the video network destination address DA, the video network source address SA, the video network packet type, and the packet length of the packet meet the requirements, and if so, allocates a corresponding stream identifier (stream-id); then, the MAC deletion module 410 subtracts MAC DA, MAC SA, length or frame type (2byte) and enters the corresponding receiving buffer, otherwise, discards it;

the downlink network interface module 401 detects the sending buffer of the port, and if there is a packet, obtains the ethernet MAC DA of the corresponding terminal according to the destination address DA of the packet, adds the ethernet MAC DA of the terminal, the MAC SA of the ethernet protocol gateway, and the ethernet length or frame type, and sends the packet.

The other modules in the ethernet protocol gateway function similarly to the access switch.

A terminal:

the system mainly comprises a network interface module, a service processing module and a CPU module; for example, the set-top box mainly comprises a network interface module, a video and audio coding and decoding engine module and a CPU module; the coding board mainly comprises a network interface module, a video and audio coding engine module and a CPU module; the memory mainly comprises a network interface module, a CPU module and a disk array module.

1.3 devices of the metropolitan area network part can be mainly classified into 2 types: node server, node exchanger, metropolitan area server. The node switch mainly comprises a network interface module, a switching engine module and a CPU module; the metropolitan area server mainly comprises a network interface module, a switching engine module and a CPU module.

2. Video networking packet definition

2.1 Access network packet definition

The data packet of the access network mainly comprises the following parts: destination Address (DA), Source Address (SA), reserved bytes, payload (pdu), CRC.

As shown in the following table, the data packet of the access network mainly includes the following parts:

DA

SA

Reserved

Payload

CRC

wherein:

the Destination Address (DA) is composed of 8 bytes (byte), the first byte represents the type of the data packet (such as various protocol packets, multicast data packets, unicast data packets, etc.), there are 256 possibilities at most, the second byte to the sixth byte are metropolitan area network addresses, and the seventh byte and the eighth byte are access network addresses;

the Source Address (SA) is also composed of 8 bytes (byte), defined as the same as the Destination Address (DA);

the reserved byte consists of 2 bytes;

the payload part has different lengths according to different types of datagrams, and is 64 bytes if the datagram is various types of protocol packets, and is 32+1024 or 1056 bytes if the datagram is a unicast packet, of course, the length is not limited to the above 2 types;

the CRC consists of 4 bytes and is calculated in accordance with the standard ethernet CRC algorithm.

2.2 metropolitan area network packet definition

The topology of a metropolitan area network is a graph and there may be 2, or even more than 2, connections between two devices, i.e., there may be more than 2 connections between a node switch and a node server, a node switch and a node switch, and a node switch and a node server. However, the metro network address of the metro network device is unique, and in order to accurately describe the connection relationship between the metro network devices, parameters are introduced in the embodiment of the present invention: a label to uniquely describe a metropolitan area network device.

In this specification, the definition of the Label is similar to that of the Label of MPLS (Multi-Protocol Label Switch), and assuming that there are two connections between the device a and the device B, there are 2 labels for the packet from the device a to the device B, and 2 labels for the packet from the device B to the device a. The label is classified into an incoming label and an outgoing label, and assuming that the label (incoming label) of the packet entering the device a is 0x0000, the label (outgoing label) of the packet leaving the device a may become 0x 0001. The network access process of the metro network is a network access process under centralized control, that is, address allocation and label allocation of the metro network are both dominated by the metro server, and the node switch and the node server are both passively executed, which is different from label allocation of MPLS, and label allocation of MPLS is a result of mutual negotiation between the switch and the server.

As shown in the following table, the data packet of the metro network mainly includes the following parts:

DA

SA

Reserved

label (R)

Payload

CRC

Namely Destination Address (DA), Source Address (SA), Reserved byte (Reserved), tag, payload (pdu), CRC. The format of the tag may be defined by reference to the following: the tag is 32 bits with the upper 16 bits reserved and only the lower 16 bits used, and its position is between the reserved bytes and payload of the packet.

Based on the characteristics of the above-mentioned video network, the applicant proposes the following core technical concept to solve the above technical problems: the server updates decoding marks of audio streams belonging to the second participant terminals in the target mixed audio stream sent to the first participant terminal into preset marks according to the encoder control instruction, so that the first participant terminal discards the audio streams belonging to the second participant terminals in the target mixed audio stream, and the second participant terminals mute the first participant terminal.

First, an implementation environment of the embodiment of the present application is described, which specifically includes a conference control end located in the internet and a server connected to the conference control end. The server has a function of supporting the video networking conference controlled by the conference control terminal, specifically, the server may be a single server or a server cluster composed of a plurality of servers, and in the server cluster, different servers can execute different tasks in the video networking conference.

Referring to fig. 5, a schematic diagram of a communication architecture in the embodiment of the present invention is shown, as shown in fig. 5, the communication architecture includes a conference control end located in the internet, where as shown in fig. 5, a server connected to the conference control end is a server cluster, and specifically, the server cluster may include a protocol server and a video network server, where the protocol server is configured to convert various types of information sent by the conference control end into a video network protocol and send the video network protocol to the video network server.

The conference control terminal can be a terminal equipped with a conference control client, and the conference control terminal can be understood as a multipoint control unit of the internet traditional protocol, is used for creating and controlling a conference organized by video conference terminals of the H323 protocol, and can perform conference control operations such as muting, screen mixing and the like in the conference. The conference control end may control a plurality of participant terminals participating in the conference, for example, as shown in fig. 5, the conference control end may control the participant terminals 1 to 5.

Among the multiple participant terminals participating in the conference, some of the participant terminals may be located in the internet, and the rest of the participant terminals may be located in the video network, as shown in fig. 5, the participant terminals 1 to 5 are located in the internet, and the rest of the participant terminals are located in the video network. The participating terminal can be a set top box, a smart phone, a network television, a camera, a microphone and other equipment.

When the server includes the protocol conversion server, the protocol conversion server is responsible for communication between the video network and the internet, that is, is responsible for communicating devices in the video network and the internet, and in the communication process, the function of converting a streaming media protocol and encapsulating audio and video data packets can be executed.

When the server comprises the protocol conversion server, the participant terminal positioned in the internet can be in communication link with the protocol conversion server so as to send the audio of the participant terminal to the video network server through the protocol conversion server.

When the server comprises the video networking server, the video networking server can be used for processing the audio and video sent by each participating terminal in the conference and then transmitting the audio and video data among the participating terminals so as to maintain the normal operation of the conference. For example, the video network server may receive and mix the audio of the speaking terminal in the participating terminals, and then transmit the audio to the other participating terminals, or may receive and transmit the video of each participating terminal to the other participating terminals.

In the following, with reference to the communication architecture diagram shown in fig. 5, a conference control method based on a video network according to an embodiment of the present application is introduced from a server side, and referring to fig. 6, a flowchart of steps of the conference control method based on a video network according to an embodiment of the present application is shown, and is specifically applied to a server, and may include the following steps:

step S601: and receiving an encoder control instruction obtained after performing video networking protocol conversion on the mute request sent by the conference control terminal.

In an example, as shown in the communication architecture shown in fig. 5, the protocol conversion server may receive a mute request sent by the conference control end, and then perform video networking protocol conversion on the received mute request to obtain an encoder control instruction, and send the encoder control instruction to the video networking server.

Of course, in an example, if the server is a separate server, the server may receive the mute request sent by the conference control end, and convert the mute request into the encoder control instruction of the video networking protocol.

In this embodiment, the muting request sent by the conference control end may be generated according to a muting operation performed by a user on the conference control end. In practice, a terminal list of the participant terminals currently participating in the conference may be set on the conference control terminal, and the user may select the first participant terminal to be muted and the second participant terminal from the terminal list, where the first participant terminal to be muted cannot play the audio stream of the second participant terminal, and thus the second participant terminal is muted relative to the first participant terminal.

The mute request sent by the conference control terminal may be a request signaling of an H245 protocol, the H245 protocol is a multimedia communication control protocol, which is a control signaling protocol in an h.323 multimedia communication system, and is mainly used for end-to-end h.245 information exchange between h.323 endpoints or terminals in communication.

In this embodiment, since the mute request is sent by the conference control end in the internet, and the mute request conforms to the internet protocol, and the mute request conforms to the H245 protocol, in order to enable the server in the video network to respond to the mute request, the protocol conversion server needs to convert the mute request into an encoder control instruction of the video network protocol, and send the encoder control instruction to the video network server, so that the video network server can respond to the mute request. The encoder control instruction may instruct the video network server to further change the codec flag of the corresponding audio stream in the mixed audio stream sent to the first participant terminal.

Step S602: and determining a first participating terminal and at least one second participating terminal corresponding to the encoder control instruction.

In this embodiment, the mute request may carry the identifier of the first participating terminal and the identifiers of the second participating terminals, and further, the encoder control instruction may also carry the identifier of the first participating terminal and the identifiers of the second participating terminals. In this embodiment, the video network server in the server cluster may determine the first participant terminal and the second participant terminal according to the identifier of the first participant terminal and the identifiers of the second participant terminals.

In the video network, the identifier of the participating terminal can be a MAC address or a video network number.

Step S603: and searching the target mixed audio stream sent to the first participant terminal from the current multi-channel mixed audio stream.

And each mixed audio stream in the multiple paths of mixed audio streams is audio obtained after audio streams sent by a plurality of current speaking terminals are mixed.

In this embodiment, the video networking server in the server cluster may determine that each second participating terminal in the current conference needs to be muted relative to the first participating terminal based on the encoder control instruction, that is, the first participating terminal needs not to hear the sound of each second participating terminal. In this way, the video network server can search the target mixed audio stream sent to the first participant terminal from the current multi-channel mixed audio streams.

In the current conference, a plurality of speaking terminals are generally provided, and an audio stream emitted by each speaking terminal needs to be listened to by other participating terminals. Thus, there is a mixed audio stream for each participant terminal that is sent to that participant terminal, and thus, multiple mixed audio streams. For each participating terminal, the video network server side may mix audio streams of the received audio streams of the plurality of speaking terminals except for the audio stream of the participating terminal (transmission object), so as to obtain a mixed audio stream for the participating terminal.

For example, there are 5 current speaking terminals, which are terminal 1, terminal 2, terminal 3, terminal 4 and terminal 5, respectively, where the terminal that needs to receive audio is terminal 3, and then the video network server mixes the audio streams of terminal 1, terminal 2, terminal 4 and terminal 5 and sends the mixed audio streams to terminal 3. By analogy, mixed audio streams for terminal 1, terminal 2, terminal 4 and terminal 5 can be obtained.

In this way, in the server cluster, when the video network server receives the encoder control instruction, the target mixed audio stream sent to the first participant terminal can be searched, and then the encoding and decoding marks of the corresponding audio streams in the target mixed audio stream are processed.

Step S604: and updating the decoding mark of the audio stream belonging to the at least one second participating terminal in the target mixed audio stream into a preset mark, so that the first participating terminal discards the audio stream belonging to the at least one second participating terminal in the target mixed audio stream according to the preset mark.

In this embodiment, since it is necessary that the first participating terminal cannot play the audio stream of each second participating terminal, so as to implement that the second participating terminal mutes the first participating terminal, the decoding flag of the audio stream belonging to each second participating terminal in the target mixed audio stream may be updated to the preset flag. The decoding flag of an audio stream is used to characterize whether the audio stream is prohibited from being decoded, if the audio stream is prohibited from being decoded, the audio cannot be decoded, and the audio can be discarded.

In specific implementation, each mixed audio stream may carry its own audio decoder instruction, the decoder instruction stores a decoding address and a decoding switch flag of each audio channel in the mixed audio stream, the decoding switch flag is a decoding flag, when a participant terminal is expected to receive a certain audio channel, the decoder instruction includes the decoding address of the audio channel and a flag for starting decoding, where the flag for starting decoding may be 1, and if the flag for decoding is 0, the decoding is prohibited. Thus, the receiving end can decode and play the received mixed audio stream adaptively according to the instruction of the audio decoder.

In this embodiment, the video network server in the server cluster may update the decoding flag of the audio stream belonging to each second participating terminal to a preset flag, where the preset flag indicates that the audio stream of the second participating terminal is prohibited from being decoded. Therefore, the first participant terminal can discard the audio stream of the target mixed audio stream, which is decoded and marked as the preset mark, according to the preset mark, so that the audio stream of the second participant terminal cannot be played at the first participant terminal, and the silence of the second participant terminal relative to the first participant terminal is realized.

It is to be appreciated that the present application is directed to modifying the decoding tags of the audio streams belonging to the second participant terminal in the target mixed audio stream transmitted to the first participant terminal, and for the mixed audio streams transmitted to the other participant terminals except the first participant terminal, the decoding tags of the audio streams belonging to the second participant terminal in the other mixed audio streams are still allowed to be decoded, so that the normal listening of the audio streams of the second participant terminal by the other participant terminals is not affected.

For example, as shown in fig. 5, there are 5 current speaking terminals, namely terminal 1, terminal 2, terminal 3, terminal 4 and terminal 5, wherein it is necessary to make terminal 1 unable to listen to the audio of terminal 2 and terminal 3, so that in the mixed audio stream sent to terminal 1, terminal 1 cannot decode the audio of terminal 2 and terminal 3 but can normally play the audio of terminal 4 and terminal 5, and as for the other terminals 4 and terminal 5, it can normally listen to the audio of terminal 2 and terminal 3.

In this embodiment, the decoding flag may be carried in the mixed audio stream, so that the effect of muting the designated participating terminal by part of the speaking terminal may be achieved by setting the decoding flag of each audio stream, and normal listening of other participating terminals not designated to be muted is ensured, and further, the problems of high labor cost and low efficiency caused by manually closing the corresponding audio channel in order to mute the participating terminal by part of the speaking terminal in the process of conference proceeding may be avoided, thereby improving conference control efficiency.

In combination with the above embodiments, in one example, when the decoding flag of the audio stream belonging to the at least one second participating terminal in the target mixed audio stream is updated to a preset flag, the target audio stream belonging to the at least one second participating terminal may be determined in the target mixed audio stream; and keeping the decoding marks of other audio streams in the target mixed audio stream unchanged, and setting the decoding mark of each target audio stream as a preset mark. Wherein the other audio streams are audio streams of the target mixed audio stream except the target audio stream.

In this embodiment, in general, the mixed audio stream includes audio streams generated by all the speaking terminals in the current conference, and at least one second participating terminal may be all the speaking terminals or part of the speaking terminals, and in the case that at least one second participating terminal is part of the speaking terminals, the first participating terminal needs to normally listen to the audio of other speaking terminals except for the second participating terminal, so that the decoding flags of the audio streams of the other speaking terminals can be kept unchanged, that is, the decoding flags of the audio streams of the other speaking terminals are kept as flags allowing decoding, thereby ensuring that in the target mixed audio stream, the audio stream of the second participating terminal is not listened to by the first participating terminal, and the rest of the audio streams can be listened to by the first participating terminal.

With reference to the foregoing embodiments, in an example, the mute duration of the second participant terminal to the first participant terminal may be set, so that the first participant terminal may be controlled not to decode and play the audio of each second participant terminal within the mute duration.

Accordingly, when the server cluster includes the video network server and the protocol conversion server, the video network server may transmit the target mixed audio stream with the updated decoding mark to the protocol conversion server, so that the protocol conversion server transmits the target mixed audio stream with the updated decoding mark to the first participant terminal. In this case, based on the mute request, the cooperative server in the server cluster may record a sending duration for sending the target mixed audio stream after receiving the updated decoding flag to the first participant terminal, and generate the audio recovery instruction when the sending duration reaches the mute duration.

In this way, when receiving the audio recovery instruction, the video networking server in the server cluster may recover, based on the audio recovery instruction, the decoding flag of the audio stream belonging to the at least one second participating terminal in the target mixed audio stream to be a decodable flag, so that the first participating terminal decodes and plays the audio stream belonging to the at least one second participating terminal in the target mixed audio stream.

In this embodiment, the silent duration may refer to a silent duration of the second participant terminal relative to the first participant terminal, that is, a duration in which the first participant terminal cannot listen to the audio stream of the second participant terminal. In practice, when the user sets the first participating terminal through the conference control terminal and cannot listen to the audio of the second participating terminal, the user can set the first participating terminal not to listen to the audio within a specified time period, and therefore the flexibility of mute control can be improved.

Specifically, the transmission duration of the target mixed audio stream after the decoding mark is updated and transmitted to the first participant terminal may be recorded by the cooperative conversion server, and since the target mixed audio stream after the decoding mark is changed is the mute audio stream of the second participant terminal to the first participant terminal, the transmission duration of the target mixed audio stream may be recorded so as to obtain the mute duration of the second participant terminal to the first participant terminal.

When the transmission duration of the target mixed audio stream in which the decoding flag is changed to the preset flag reaches the mute duration, the mute of the second participant terminal to the first participant terminal can be cancelled, so that the first participant terminal can listen to the audio stream in the second participant. In this way, the collaboration server may generate an audio recovery instruction of the video networking protocol, and send the audio recovery instruction of the video networking protocol to the video networking server, so that the video networking server recovers, based on the audio recovery instruction, the decoding flag of the audio stream belonging to each second participant terminal in the target mixed audio stream to be currently sent to the first participant terminal to be a decodable flag, for example, recovers the decoding flag from "0" to "1", and then, after receiving the target mixed audio stream, the first participant terminal may decode and play the audio stream of each second participant terminal.

With reference to the foregoing embodiments, in an embodiment, in a conference, a speaking duration of each speaking terminal may be further defined, and when the speaking duration of the speaking terminal reaches a specified duration, the speaking terminal is controlled to be muted relative to other terminals, and accordingly, the following process may be further included:

first, a video networking speech control instruction obtained by performing video networking protocol conversion on a speech control instruction sent by the conference control end may be received.

And the video network speaking control instruction carries the identifier and speaking duration of a third participant terminal to be controlled.

In the embodiment of the present application, as shown in fig. 5, a protocol conversion server in a server cluster may perform video networking protocol conversion on a speech control instruction sent by a conference control end, so as to obtain a video networking speech control instruction, and then send the video networking speech control instruction to a video networking server.

Then, the video network server in the server cluster may record, starting from the current time, the sending duration of the audio stream belonging to the third participant terminal in each mixed audio stream in response to the video network talk control instruction.

Then, when the sending duration of the audio stream of the third participant terminal reaches the speaking duration, the video networking server in the server cluster may update the decoding flag of the audio stream belonging to the third participant terminal in the multiple paths of mixed audio streams to a preset flag, so that other participant terminals in the video conference discard the audio stream belonging to the third participant terminal according to the preset flag.

In practice, in an audio and video conference, a speaking time length needs to be set for a speaking terminal, for example, the speaking time length of a participant terminal is set within 5 minutes, and under a general condition, the speaking time length is limited to be controlled by a meeting place host, so that the workload of the meeting place host is increased, and under the condition that the requirement on the speaking time length is strict, the error of the speaking time length controlled by the person is large, and the problem that the subsequent conference progress is affected because the speaking person still needs to speak actively after the speaking time length arrives cannot be avoided.

For this reason, the conference control end of the present application may set the speaking duration of each participant terminal, and of course, different participant terminals may have different speaking durations or may have the same speaking duration, which is not limited herein. After the speaking duration is set, the conference control end can generate a speaking control instruction, the speaking control instruction is sent to a protocol server in the server cluster, and the protocol server converts the speaking control instruction of the internet into a video networking speaking control instruction of the video networking. The speech control instruction carries the identifier of the third participant terminal whose speech duration is to be controlled and the permitted speech duration of the third participant terminal, and it can be understood that the speech control instruction of the video network also carries the identifier of the third participant terminal and the speech duration to be controlled.

Then, when the video networking server in the server cluster receives the video networking talk control instruction, the sending duration of the audio stream belonging to the third participant terminal in each current mixed audio stream can be recorded.

When the sending duration of the mixed audio stream reaches the speaking duration, the decoding flag of the audio stream belonging to the third participating terminal in each mixed audio stream may be updated to a preset flag, for example, the decoding flag is changed from "1" to "0", so that other participating terminals in the video conference discard the audio stream belonging to the third participating terminal according to the preset flag. Therefore, when the speaking time of the third participant terminal reaches the specified time, even if the speaker of the third participant terminal still speaks, other participants cannot listen to the audio, so that the normal speaking of the subsequent speaking participant terminal can be ensured, and the conference is carried out according to the normal progress.

Of course, in an example, after the decoding flag of the audio stream belonging to the third participating terminal in each mixed audio stream of the multiple mixed audio streams is updated to the preset flag, the video network server in the server cluster may send the muted information to the third participating terminal, so that the third participating terminal displays the muted information, and the user of the third participating terminal may know that the voice sent by the user cannot be listened to by other terminals.

With reference to the foregoing embodiments, in an example, if a participant terminal having a speaking duration reaching a specified duration wants to speak subsequently, the participant terminal may send a speaking request to the conference control end to reapply for speaking.

Accordingly, after the decoding flag of the audio stream belonging to the third participant terminal in each mixed audio stream is updated to the preset flag, the speech of the third participant terminal can be recovered under the control of the conference control end, that is, the speech of the third participant terminal can be received by other participant terminals.

Specifically, the speech request instruction sent by the conference control terminal may be received after performing video networking protocol conversion on the speech request instruction, where the speech request instruction is sent to the conference control terminal by the third participant terminal. In this embodiment, a protocol conversion server in a server cluster performs video networking protocol conversion on a speech request instruction sent by the conference control end, and then sends the converted speech request instruction to a video networking server.

Then, the video network server in the server cluster may respond to the video network speaking instruction, and restore the decoding mark of the audio stream belonging to the third participant terminal in each mixed audio stream of the multiple mixed audio streams to a decodable mark, so that other participant terminals in the video conference decode and play the audio stream belonging to the third participant terminal.

In this embodiment, after muting the third participant terminal, if the third participant terminal wants to continue speaking, the third participant terminal needs to apply for speaking to the conference control end, and in a specific implementation, the third participant terminal needs to send a speaking request instruction to the conference control end, so that the conference control end forwards the speaking request instruction to the cooperative server.

In one scenario, the third participant terminal may be a participant terminal in the internet, and the collaboration server needs to convert the request speech instruction into a video networking speech instruction of a video networking protocol, so as to send the video networking speech instruction to the video networking server.

In another scenario, the third participant terminal may also be a participant terminal in the video network, and the collaboration server may directly send the request speech instruction to the video network server.

The video network server can respond to the video network speaking instruction, and restore the decoding mark of the audio stream belonging to the third participating terminal in each mixed audio stream to a decodable mark, so that other participating terminals in the video conference decode and play the audio stream belonging to the third participating terminal. For example, the decoding flag of the audio stream of the third participant terminal in the mixed audio stream is changed from "0" to "1", so that the other participant terminals can listen to the speech of the third participant terminal.

With reference to the foregoing embodiments, in an example, if a certain participant terminal needs to join the conference or quit the conference in the conference, a video networking conference quit instruction may also be sent to the server to quit the conference, and specifically, a video networking conference quit instruction may be sent to the video networking server through the coordination server to quit the conference.

Correspondingly, the server may receive a video networking conference quitting instruction obtained by performing video networking protocol conversion on the conference quitting instruction sent by the conference control terminal, where the video networking conference quitting instruction carries an identifier of a fourth participant terminal quitting the video conference.

In this embodiment of the application, the protocol conversion of the video networking can be performed on the conference exit instruction sent by the conference control end by the protocol conversion server, and the video networking conference exit instruction obtained after the conversion is sent to the video networking server, and the video networking server can respond to the video networking conference exit instruction and close the audio channel with the fourth participant terminal, so that the audio stream of the fourth participant terminal is not contained in each mixed audio stream, and the fourth participant terminal can be a terminal which wants to exit the video networking conference. By adopting the embodiment, when the participant terminal needs to quit the video conference, the network resource allocated to the participant terminal can be released in time, so that the resource can be allocated to the participant terminal newly joining the video conference, and the utilization rate of the video networking resource is improved.

In this embodiment, the collaboration server may pull some of the participant terminals into the conference or withdraw some of the participant terminals from the conference in response to an instruction from the conference control terminal. When some participant terminals exit the conference, the participant terminals do not need to allocate video networking resources to the participant terminals after exiting the video networking conference, so that the video networking server can close the audio channel between the video networking server and the fourth participant terminal, and after the audio channel between the video networking server and the fourth participant terminal is closed, the video networking server cannot receive the audio sent by the fourth participant terminal, so that each path of mixed audio stream does not contain the audio stream of the fourth participant terminal.

In one scenario, the fourth participant terminal may actively send a request for exiting the conference to the conference control terminal, and then the conference control terminal forwards the request for exiting the conference to the coordination server.

In another scenario, the conference control end may designate a fourth participant terminal to exit the conference, so that the conference control end may generate a conference exit instruction according to an operation of the user to exit the fourth participant terminal from the conference, and further send the conference exit instruction to the collaboration server.

By adopting the technical scheme of the embodiment, in the audio and video conference, under the condition that the existence of the audio channels of the participant terminals participating in the conference is ensured, the speaking audio of part of the participant terminals can be set not to be listened to by a certain participant terminal, and further, the speaking duration of each participant terminal can be set, so that the speaking of each participant terminal can be flexibly controlled in the conference, manual control is not needed, and the efficiency of the video conference is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 7, there is shown a video network based conference control apparatus, including a conference control terminal located in the internet, a server connected to the conference control terminal; the device is located in the server, and specifically may include the following modules:

a request receiving module 701, configured to receive the encoder control instruction obtained after performing video networking protocol conversion on the mute request sent by the conference control end;

a determining module 702, configured to determine a first participating terminal and at least one second participating terminal corresponding to the encoder control instruction;

the instruction response module 703 is configured to search for a target mixed audio stream sent to the first participant terminal from the current multiple mixed audio streams; each mixed audio stream in the multiple paths of mixed audio streams is an audio obtained after the video networking server performs audio mixing on audio streams sent by a plurality of current speaking terminals;

a tag changing module 704, configured to update the decoding tag of the audio stream of the target mixed audio stream belonging to the at least one second participating terminal to a preset tag, so that the first participating terminal discards the audio stream belonging to the at least one second participating terminal in the mixed audio stream according to the preset tag.

Optionally, the mark changing module 704 may specifically include the following units:

a determining unit, configured to determine, in the target mixed audio stream, a target audio stream belonging to the at least one second participant terminal;

and the changing unit is used for setting the decoding mark of the target audio stream as a preset mark and keeping the decoding marks of other audio streams except the target audio stream in the target mixed audio stream unchanged.

Optionally, the mute request further carries a mute duration, and the encoder control instruction also includes the mute duration; the apparatus may further include the following modules:

the first instruction receiving module is used for generating an audio recovery instruction when the sending duration of the target mixed audio stream after the decoding mark is updated is determined to reach the mute duration;

and the mark recovery module is used for recovering the decoding marks of the audio streams belonging to the at least one second participating terminal in the target mixed audio stream into decodable marks based on the audio recovery instruction, so that the first participating terminal decodes and plays the audio streams belonging to the at least one second participating terminal in the target mixed audio stream.

Optionally, the apparatus may further include the following modules:

the second instruction receiving module is used for receiving a video networking speech control instruction obtained by performing video networking protocol conversion on the speech control instruction sent by the conference control end; the video network speech control instruction carries an identifier and speech duration of a third participant terminal to be controlled;

the instruction response module 703 is further configured to respond to the video networking speech control instruction, and record, starting from the current time, a sending duration of an audio stream belonging to the third participant terminal in the multiple paths of mixed audio streams;

the tag changing module 704 is further configured to, when the sending duration of the audio stream of the third participating terminal reaches the speaking duration, update a decoding tag of the audio stream belonging to the third participating terminal in each mixed audio stream of the multiple mixed audio streams to a preset tag, so that other participating terminals in the video conference discard the audio stream belonging to the third participating terminal according to the preset tag.

Optionally, the apparatus may further include the following modules:

the third instruction receiving module is used for receiving a video networking speaking instruction obtained by performing video networking protocol conversion on the speaking request instruction sent by the conference control terminal, wherein the speaking request instruction is sent to the conference control terminal by the third participant terminal;

and the speaking instruction response module is used for responding to the video networking speaking instruction, and respectively restoring the decoding marks of the audio streams belonging to the third participant terminal in each mixed audio stream in the multiple mixed audio streams into decodable marks so as to enable other participant terminals in the video conference to decode and play the audio streams belonging to the third participant terminal.

Optionally, the apparatus may further include the following modules:

the fourth instruction receiving module is used for receiving a video networking conference quitting instruction obtained after video networking protocol conversion is carried out on the conference quitting instruction sent by the conference control terminal; the video networking conference exit instruction carries an identifier of a fourth participant terminal which exits the video conference;

and the conference instruction response module is used for responding to the video networking conference exit instruction and closing an audio channel between the video networking conference exit instruction and the fourth participant terminal so that each path of mixed audio stream does not contain the audio stream of the fourth participant terminal.

It should be noted that the device embodiments are similar to the method embodiments, so that the description is simple, and reference may be made to the method embodiments for relevant points.

An embodiment of the present invention further provides an electronic device, including:

comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the video networking based conference control method according to the above embodiment of the invention.

Embodiments of the present invention further provide a computer-readable storage medium, which stores a computer program to enable a processor to execute the video network-based conference control method according to any of the above embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart 1 flow or flows and/or block 1 block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks of the flowchart 1 and/or block diagram 1.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart 1 flow or flows and/or block 1 block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The conference control method, device, equipment and storage medium based on the video network provided by the invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiment is only used to help understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A conference control method based on video networking is characterized by comprising a conference control terminal positioned in the Internet and a server connected with the conference control terminal; the method is applied to the server and comprises the following steps:

2. The method of claim 1, wherein updating the decoding flag of the audio stream belonging to the at least one second participant terminal in the target mixed audio stream to a preset flag comprises:

setting the decoding mark of the target audio stream as a preset mark, and keeping the decoding marks of other audio streams in the target mixed audio stream unchanged; wherein the other audio streams are audio streams of the target mixed audio stream except the target audio stream.

3. The method of claim 1, wherein the mute request further carries a mute duration, and the encoder control instruction includes the mute duration; the method further comprises the following steps:

4. The method of claim 1, further comprising:

and when the sending time of the audio stream of the third participant terminal reaches the speaking time, updating the decoding mark of the audio stream belonging to the third participant terminal in each mixed audio stream in the multi-channel mixed audio stream to the preset mark, so that other participant terminals in the video conference discard the audio stream belonging to the third participant terminal according to the preset mark.

5. The method as claimed in claim 4, wherein after updating the decoding flag of the audio stream of the multi-channel mixed audio stream belonging to the third participant terminal to the preset flag, the method further comprises:

and responding to the video networking speaking instruction, and respectively restoring the decoding marks of the audio streams belonging to the third participant terminal in each mixed audio stream in the multiple mixed audio streams into decodable marks, so that other participant terminals in the video conference decode and play the audio streams belonging to the third participant terminal.

6. The method of claim 1, further comprising:

7. A conference control device based on video networking, which is characterized in that the device comprises a conference control terminal located in the Internet and a server connected with the conference control terminal, wherein the device is located at the server and comprises:

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing implementing the video networking based conference control method of any of claims 1-6.

9. A computer-readable storage medium storing a computer program for causing a processor to execute the video network-based conference control method according to any one of claims 1 to 6.