CN108924465B - Method, device, equipment and storage medium for determining speaker terminal in video conference - Google Patents

Method, device, equipment and storage medium for determining speaker terminal in video conference Download PDF

Info

Publication number
CN108924465B
CN108924465B CN201810669467.3A CN201810669467A CN108924465B CN 108924465 B CN108924465 B CN 108924465B CN 201810669467 A CN201810669467 A CN 201810669467A CN 108924465 B CN108924465 B CN 108924465B
Authority
CN
China
Prior art keywords
audio
terminal
determining
audio packet
packet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810669467.3A
Other languages
Chinese (zh)
Other versions
CN108924465A (en
Inventor
王运璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Guangzhou Shizhen Information Technology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Guangzhou Shizhen Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd, Guangzhou Shizhen Information Technology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810669467.3A priority Critical patent/CN108924465B/en
Publication of CN108924465A publication Critical patent/CN108924465A/en
Application granted granted Critical
Publication of CN108924465B publication Critical patent/CN108924465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/1066Session management
    • H04L65/1083In-session procedures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/65Network streaming protocols, e.g. real-time transport protocol [RTP] or real-time control protocol [RTCP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/63Control signaling related to video distribution between client, server and network components; Network processes for video distribution between server and clients or between remote clients, e.g. transmitting basic layer and enhancement layers over different transmission paths, setting up a peer-to-peer communication via Internet between remote STB's; Communication protocols; Addressing
    • H04N21/643Communication protocols
    • H04N21/6437Real-time Transport Protocol [RTP]

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for determining a speaker terminal in a video conference, wherein the method comprises the following steps: acquiring the audio level of an audio packet from a terminal and timestamp information of the audio packet; extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; superposing the target audio level, and taking a superposition result as the audio level of the terminal; and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal. A time window counting method is adopted to determine the speaker terminal of the video conference, and the size of the time window is automatically adjusted, so that the speaker terminal is switched more sensitively and more stably.

Description

Method, device, equipment and storage medium for determining speaker terminal in video conference
Technical Field
The present invention relates to communications technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a speaker terminal in a video conference.
Background
Video conferencing refers to a conference in which people at two or more locations have a face-to-face conversation via a communication device and a network. In the video conference, the participants can hear the sound of other meeting places, see the images, actions and expressions of participants in other meeting places, and can also send electronic demonstration contents.
There are often more than two terminals in a video conference, and the client often has a problem that the number of display windows is less than the number of terminals in the conference. In practical video conferencing systems, there is also a need to quickly divert the attention of the participants to the person speaking in the conference. Therefore, how to determine the problem to be solved in the speaker videoconference system is urgent.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art. The terminal sends the speaking state of the conference participant to the server according to a certain frequency, the server judges the current conference speaker terminal, but the delay of several seconds causes poor experience to the user; or the times of the audio sampling data of each participant appearing in the preset frequency band are counted to judge that the speaker terminal of the current conference removes the noise of some specific frequency bands, the requirement on the environment is high, and the sensitivity of switching the speaker terminal by the method is also relatively slow; or the terminal judges the current conference speaker terminal according to the collected consecutive voice signals with the preset length threshold, and when the number of users is large, the voice signal lengths of a plurality of terminals reach the preset length at the same time, the speaker terminal is difficult to determine.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for determining a video conference speaker terminal, so that the video conference speaker terminal is switched more sensitively and more stably.
In a first aspect, an embodiment of the present invention provides a method for determining a speaker terminal in a video conference, where the method includes: acquiring the audio level of an audio packet from a terminal and timestamp information of the audio packet; extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; superposing the target audio level, and taking a superposition result as the audio level of the terminal; and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal.
In a second aspect, an embodiment of the present invention further provides a device for determining a speaker terminal in a video conference, where the device includes: the acquisition module is used for acquiring the audio level of the audio packet from the terminal and the time stamp information of the audio packet; the target audio packet extraction module is used for extracting the audio packets contained in the set time window as target audio packets according to the time stamp information sequence; the target audio level determining module is used for setting corresponding weight values for the target audio packets according to a set rule and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; the superposition module is used for superposing the target audio level and taking a superposition result as the audio level of the terminal; and the speaker terminal determining module is used for determining the terminal corresponding to the largest audio level at the current moment as the video conference speaker terminal.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for determining a video conference speaker terminal according to any one of the embodiments of the present invention.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for determining a video conference speaker terminal according to any one of the embodiments of the present invention.
In the embodiment of the invention, the audio level of an audio packet from a terminal and the time stamp information of the audio packet are obtained; extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; superposing the target audio level, and taking a superposition result as the audio level of the terminal; and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal. A time window counting method is adopted to determine the speaker terminal of the video conference, and the size of the time window is automatically adjusted, so that the speaker terminal is switched more sensitively and more stably.
Drawings
Fig. 1a is a flowchart of a method for determining a speaker terminal in a video conference according to a first embodiment of the present invention;
FIG. 1b is a diagram illustrating an audio packet transmission process according to a first embodiment of the present invention;
fig. 1c is a schematic diagram of audio levels of audio packets and time stamp information of the audio packets at different time instants, which is applicable in one embodiment of the present invention;
fig. 2 is a flowchart of a method for determining a speaker terminal in a video conference according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a determining apparatus of a video conference speaker terminal in a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
In the embodiment of the invention, a conference in which people in a plurality of places have face-to-face conversation with a network through communication equipment is called a video conference, wherein the communication equipment comprises a smart conference tablet, a smart phone, a smart television and the like. The screen display sizes of different communication devices are different, and when the pictures of a plurality of (for example, 4) participants need to be displayed on the screen of the communication device, for example, the upper left corner display window 1 is beijing, the upper right corner display window 2 is shanghai, the lower left corner display window 3 is guangzhou, and the lower right corner display window 4 is shenzhen. When the screen display size of the communication device is too small (e.g., a smartphone), displaying the current picture of the participant in its entirety results in a too small display window for each participant. Aiming at the problem, the embodiment of the invention determines the current video conference speaker terminal, and then can perform subsequent operations, such as highlighting or amplifying and displaying the display picture of the current video conference speaker terminal.
Example one
Fig. 1a is a flowchart of a method for determining a speaker terminal in a video conference according to an embodiment of the present invention, where this embodiment is applicable to a situation how to quickly transfer attention of conference participants to a speaker terminal in a conference, and the method may be executed by a device for determining a speaker terminal in a video conference according to an embodiment of the present invention, and the device may be implemented in a software and/or hardware manner. Referring to fig. 1a, the method may specifically include the steps of:
and S110, acquiring the audio level of the audio packet from the terminal and the time stamp information of the audio packet.
Specifically, for example, an audio transmission scheme based on WebRTC (Web Real-Time Communication ) is used, and WebRTC is a technology supporting a Web browser to perform Real-Time voice conversation or video conversation, and realizes a Web-based video conference. In the process of collecting and sending the audio of the participants, the terminal attaches the audio level of the current audio packet to each RTP (Real-time Transport Protocol) packet. Wherein the audio level is represented by AudioLevel.
Fig. 1b shows a schematic diagram of an audio packet transmission process, where a terminal sends audio packets to a server at certain time intervals, and the server parses the audio packets sent by each terminal to obtain the audio level of the audio packets from the terminal. The terminal takes an intelligent conference tablet as an example, the intelligent conference tablet sends an audio packet with an audio level to the server, and the server acquires the audio level of the audio packet from the intelligent conference tablet. A time stamp is a complete, verifiable piece of data, usually a sequence of characters, that represents a piece of data that existed before a particular time, uniquely identifying the time of the moment. The timestamp information in the embodiments of the present invention uniquely identifies the transmission time of the corresponding audio packet.
In a specific example, referring to fig. 1b, the audio level of each audio packet is represented by L1, L2, L3, L4, L5, L6, and L7, and the corresponding timestamp information is represented by T1, T2, T3, T4, T5, T6, and T7 in this order. The smaller the number is, the earlier the time for transmitting the audio packet is, and the 7 audio packets are only used for illustration and do not limit the number of the audio packets. Optionally, the frequency of sending the audio packet to the server by each terminal may be fixed or may also be variable, and in order to improve the accuracy of determining the speaker terminal in the video conference, the frequencies of sending the audio packets by different terminals need to be kept consistent.
And S120, extracting the audio packets contained in the set time window as target audio packets according to the time stamp information sequence.
The server may specify the size of the set time window by modifying the configuration file, for example, 1 second, and modify the size of the set time window by modifying the configuration file. During the video conference, the server records the audio level of each audio packet contained in the set time window. And sequentially extracting the audio packets contained in the set time window as target audio packets according to the timestamp information.
In a specific example, the time window is set to be 1 second, the transmission frequency of the audio packets is 200 milliseconds, when an audio packet is received, the audio level is extracted and stored, and then the audio level of the audio packet exceeding 1 second is deleted through the time stamp information. FIG. 1c shows a schematic diagram of audio levels of audio packets and time stamp information of audio packets at different time instants; 160 denotes that the target audio packet at the time of T5 is an audio packet corresponding to the audio levels L1, L2, L3, L4 and L5; 170 represents that the target audio packet at the time of T6 is an audio packet corresponding to the audio levels L2, L3, L4, L5 and L6; and 180, the target audio packet at the time of T7 is an audio packet corresponding to the audio levels L3, L4, L5, L6 and L7.
S130, setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value.
After the target audio packets and the corresponding audio levels thereof are determined, respectively setting corresponding weight values for the target audio packets according to a set rule, wherein the weight values of the target audio packets can be the same or different; may be constant or time-varying; but also parameters related to the audio level of the target audio packet, etc. And multiplying the audio level of each target audio packet by the corresponding weight value to obtain the target audio level.
Illustratively, three packets arrive in chronological order in the time window, and the audio levels are L1, L2, and L3, respectively. L3 as the latest arriving packet, corresponding to the largest weight value; l1 as the latest arriving packet corresponds to the smallest weight value, and if the weight values are 4, 3, and 2, respectively, the target audio levels are 4 × L1, 3 × L2, and 2 × L1, respectively.
And S140, superposing the target audio level, and taking the superposed result as the audio level of the terminal.
Specifically, the target audio level multiplied by the corresponding weight value is superimposed, and the superimposed result is used as the audio level of the terminal. The audio level of the terminal refers to the audio levels of different terminals at different moments, and the audio level of the terminal is related to the number of received audio packets at the current moment and the historical moment and the audio level of each audio packet. Illustratively, when three packets having audio levels L1, L2, and L3 with weight values of 4, 3, and 2, respectively, are set in the time window, the audio level of the terminal is L ═ 4 × L1+3 × L2+2 × L1, respectively. As shown in fig. 1c, assuming that the weight values of the audio packets are 5, 4, 3, 2, and 1, respectively, the audio level of the terminal at the time T5 is 5 × L1+4 × L2+3 × L3+4 × L4+5 × L5; the audio level of the terminal at the time T6 is 5 xl 2+4 xl 3+3 xl 4+4 xl 5+5 xl 6; the audio level of the terminal at the time T7 is 5 xl 3+4 xl 4+3 xl 5+4 xl 6+5 xl 7, and the audio level of the terminal recorded by the server is updated immediately as a result of the calculation at different times.
And S150, determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal.
For each terminal, the method provided by the embodiment of the invention is adopted to determine that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal. In a specific example, the terminal takes an intelligent conference tablet as an example, if two intelligent conference tablets are present in a conference scene and are respectively recorded as an intelligent conference tablet a and an intelligent conference tablet B, at a certain moment, the audio level of the intelligent conference tablet a is 100, and the audio level of the intelligent conference tablet B is 150, it may be determined that the intelligent conference tablet B is the speaker terminal of the video conference at the moment. In addition, the video conference speaker terminal can be highlighted so as to transfer the attention of the participants to the current conference speaker.
In the embodiment of the invention, the audio level of an audio packet from a terminal and the time stamp information of the audio packet are obtained; extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; superposing the target audio level, and taking a superposition result as the audio level of the terminal; and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal. A time window counting method is adopted to determine the speaker terminal of the video conference, and the size of the time window is automatically adjusted, so that the speaker terminal is switched more sensitively and more stably.
Optionally, obtaining the audio level of the audio packet and the timestamp information of the audio packet from the terminal may be implemented as follows: determining a client side for sending the audio packet according to a client side identifier carried in the audio packet; determining a terminal corresponding to the client side for sending the audio packet according to the corresponding relation between the client side and the terminal; an audio level of an audio packet from the terminal and time stamp information of the audio packet are determined.
Each audio packet carries corresponding identification data, including a client identifier of a client from which the audio packet originates, where the client identifier may be identified by an SSRC (Synchronization source), and the client identifier is identified by an SSRC identifier with a 32-bit value in an RTP header, so that the client identifier is independent of a network address, and usually, changes in a microphone, an audio interface, a camera, or a video interface all cause changes in the SSRC. Thus, upon receipt of an audio packet, the client from which the audio packet originated may be determined, the client may be a XXX video conferencing system, or the like.
In a specific example, the terminal still takes an intelligent conference tablet as an example, a XXX video conference system is configured in the intelligent conference tablet a, a YYYY video conference system is configured in the intelligent conference tablet B, and the terminal corresponding to the client of the audio packet source is determined according to the correspondence between the client and the terminal. When the same type of client is configured in a plurality of intelligent conference tablets, the corresponding intelligent conference tablet can be determined according to the factory identifier of the client. An audio level of an audio packet from the terminal and time stamp information of the audio packet are determined.
On the basis of the above technical solution, the technical solution provided by the embodiment of the present invention further includes: detecting that the switching frequency of a speaker terminal of the video conference is greater than a first frequency switching threshold value, and adjusting the size of the set time window; and fixing the set time window when detecting that the switching frequency of the video conference speaker terminal is less than or equal to a second frequency switching threshold value.
The first frequency switching threshold may be switched every 4 seconds, and the second frequency switching threshold may be switched every 15 seconds. If the server detects that the video conference speaker terminal changes every 5 seconds in the last 2 minutes on average, the server can judge that the set time window of the server is too small, increase the size of the set time window by two audio packet time intervals, and continue to record the switching frequency of the video conference speaker terminal. The time window is fixedly set when the speaker's switching frequency is less than once every 10 seconds.
Considering the effects of some non-uniform environmental noise and some artificial noise, even a terminal that is not speaking may have a larger audio level at one instant than the terminal that is speaking. If the latest audio level is directly used as the audio level of the terminal, and the video conference speaker terminal is determined based on the latest audio level, the result of frequent switching of the video conference speaker terminal is generated, which may cause bad influence to the user and require smoothing processing.
It should be noted that the set time window used by the server in performing the accumulation processing on the audio packet level may be variable. When the conference starts, a set time window corresponding to the conference takes a default value, and the switching frequency of the speaker terminal of the video conference is calculated. When the switching frequency of the video conference speaker terminal is too high, the default value of the set time window in the current scene is considered to be smaller, and the default value can be automatically increased. The method for counting the audio level by setting the time window ensures that the switching of the terminal with the maximum audio level is smoother, and eliminates the influence of uneven environmental noise and artificial noise to a certain extent.
Example two
Fig. 2 is a flowchart of a method for determining a speaker terminal in a video conference according to a second embodiment of the present invention, and in this embodiment, based on the second embodiment, the "extracting and setting audio packets included in a time window according to the time stamp information sequence as target audio packets" is optimized. Referring to fig. 2, the method may specifically include the following steps:
s210, acquiring the audio level of the audio packet from the terminal and the time stamp information of the audio packet.
And S220, determining frequency information according to the time stamp information, wherein the frequency information comprises the audio level of the acquired audio packet and the frequency of the time stamp information of the audio packet.
Specifically, the frequency information refers to a frequency at which the audio level of the audio packet and the time stamp information of the audio packet are acquired, for example, the terminal sends the audio level of the audio packet and the time stamp information of the audio packet every 20 milliseconds, that is, the server acquires the audio level of the audio packet and the time stamp information of the audio packet every 20 milliseconds. Determining the frequency information from the time stamp information means that the difference between every two adjacent time stamp information (e.g., T2-T1) is the frequency. For example, if the audio packets are transmitted at regular time intervals of 200 ms, the difference between every two adjacent pieces of time stamp information is 200 ms, i.e., the corresponding frequency.
And S230, determining the audio packet contained in the set time window as a target audio packet according to the frequency information and the timestamp information.
Specifically, if the time window is set to be 1 second, the audio packet included in the time window of 1 second is determined according to the frequency information and the timestamp information, and the audio packet is determined to be the target audio packet. In a specific example, when the target audio packets satisfying the set time window are determined from time T5, the determined target audio packets are T5, T6, T7, T8, and T9.
S240, setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value.
Optionally, the set rule includes: the weight value corresponding to the target audio packet increases as the timestamp information of the target audio packet increases.
In order to increase the weight value of the nearest audio packet, the weight value corresponding to the set target audio packet is increased along with the increase of the timestamp information of the target audio packet. For example, the weight values of L1, L2, L3, L4, L5, L6, and L7 are 0.5, 1, 1.8, 2, 2.9, 3.6, and 5 in this order.
And S250, superposing the target audio level, and taking the superposed result as the audio level of the terminal.
And S260, determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal.
In the embodiment of the present invention, frequency information is determined according to the timestamp information, where the frequency information includes a frequency at which an audio level of an audio packet and timestamp information of the audio packet are obtained, and an audio packet included in the set time window is determined as a target audio packet according to the frequency information and the timestamp information. The timestamp information is introduced, so that the determination result of the speaker terminal of the video conference is more accurate.
In the technical scheme provided by the embodiment of the invention, a window counting method is adopted to count the audio level of the terminal, and the terminal with the maximum audio level value is determined to be a speaker terminal of the video conference. The advantage is that the video conference speaker terminal switching is smoother. The time window length can be set by the server according to the application scene, and the switching frequency of the video conference speaker terminal which is comfortable for the user can be automatically adjusted. For example, when the environment of the terminal is quite quiet, the video conference speaker terminal is switched infrequently, and a smaller time window is automatically selected, so that the video conference speaker terminal is switched more sensitively; when a terminal is in a noisy environment, the video conference speaker terminal is frequently switched, and a larger time window is automatically selected, so that the video conference speaker terminal is more stably switched.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a device for determining a video conference speaker terminal according to a third embodiment of the present invention, where the device is adapted to execute a method for determining a video conference speaker terminal according to the third embodiment of the present invention. As shown in fig. 3, the apparatus may specifically include:
an obtaining module 310, configured to obtain an audio level of an audio packet from a terminal and timestamp information of the audio packet;
a target audio packet extracting module 320, configured to extract audio packets included in a set time window as target audio packets according to the time stamp information sequence;
the target audio level determining module 330 is configured to set a corresponding weight value for each target audio packet according to a set rule, and determine a target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value;
the superposition module 340 is configured to superpose the target audio level, and use a superposition result as the audio level of the terminal;
the speaker terminal determining module 350 determines that the terminal corresponding to the largest audio level at the current time is the video conference speaker terminal.
Further, the obtaining module 310 is specifically configured to:
determining a client side for sending the audio packet according to a client side identifier carried in the audio packet;
determining a terminal corresponding to the client side for sending the audio packet according to the corresponding relation between the client side and the terminal;
an audio level of an audio packet from the terminal and time stamp information of the audio packet are determined.
Further, the target audio packet extraction module 320 is specifically configured to:
determining frequency information according to the time stamp information, wherein the frequency information comprises the audio level of the acquired audio packet and the frequency of the time stamp information of the audio packet;
and determining the audio packet contained in the set time window as a target audio packet according to the frequency information and the timestamp information.
Further, the method also comprises the following steps:
the first time window adjusting module is used for adjusting the size of the set time window when detecting that the switching frequency of the video conference speaker terminal is greater than a first frequency switching threshold;
and the second time window adjusting module is used for detecting that the switching frequency of the video conference speaker terminal is less than or equal to a second frequency switching threshold value and fixing the set time window.
Further, the set rules include: the weight value corresponding to the target audio packet increases as the timestamp information of the target audio packet increases.
The device for determining the video conference speaker terminal provided by the embodiment of the invention can execute the method for determining the video conference speaker terminal provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 4 is only one example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 4, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing by executing programs stored in the system memory 28, for example, to implement the method for determining a speaker terminal in a video conference provided by the embodiment of the present invention:
that is, the processing unit implements, when executing the program: acquiring the audio level of an audio packet from a terminal and timestamp information of the audio packet; extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; superposing the target audio level, and taking a superposition result as the audio level of the terminal; and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal.
EXAMPLE five
Fifth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for determining a speaker terminal in a video conference, as provided in all inventive embodiments of this application:
that is, the program when executed by the processor implements: acquiring the audio level of an audio packet from a terminal and timestamp information of the audio packet; extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value; superposing the target audio level, and taking a superposition result as the audio level of the terminal; and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for determining a speaker terminal in a video conference, comprising:
acquiring the audio level of an audio packet from a terminal and timestamp information of the audio packet;
extracting audio packets contained in a set time window as target audio packets according to the time stamp information sequence; the set time window is set according to an application scene;
setting corresponding weight values for the target audio packets according to a set rule, and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value;
superposing the target audio level, and taking a superposition result as the audio level of the terminal;
and determining that the terminal corresponding to the maximum audio level at the current moment is the video conference speaker terminal.
2. The method of claim 1, wherein obtaining the audio level of the audio packets and the time stamp information of the audio packets from the terminal comprises:
determining a client side for sending the audio packet according to a client side identifier carried in the audio packet;
determining a terminal corresponding to the client side for sending the audio packet according to the corresponding relation between the client side and the terminal;
an audio level of an audio packet from the terminal and time stamp information of the audio packet are determined.
3. The method according to claim 1, wherein extracting audio packets included in a set time window as target audio packets in the time stamp information sequence comprises:
determining frequency information according to the time stamp information, wherein the frequency information comprises the audio level of the acquired audio packet and the frequency of the time stamp information of the audio packet;
and determining the audio packet contained in the set time window as a target audio packet according to the frequency information and the timestamp information.
4. The method of claim 1, further comprising:
detecting that the switching frequency of a speaker terminal of the video conference is greater than a first frequency switching threshold value, and adjusting the size of the set time window;
and fixing the set time window when detecting that the switching frequency of the video conference speaker terminal is less than or equal to a second frequency switching threshold value.
5. The method according to any of claims 1-4, wherein the set rules comprise: the weight value corresponding to the target audio packet increases as the timestamp information of the target audio packet increases.
6. An apparatus for determining a speaker terminal in a video conference, comprising:
the acquisition module is used for acquiring the audio level of the audio packet from the terminal and the time stamp information of the audio packet;
the target audio packet extraction module is used for extracting the audio packets contained in the set time window as target audio packets according to the time stamp information sequence; the set time window is set according to an application scene;
the target audio level determining module is used for setting corresponding weight values for the target audio packets according to a set rule and determining the target audio level obtained by multiplying the audio level of each target audio packet by the corresponding weight value;
the superposition module is used for superposing the target audio level and taking a superposition result as the audio level of the terminal;
and the speaker terminal determining module is used for determining the terminal corresponding to the largest audio level at the current moment as the video conference speaker terminal.
7. The apparatus of claim 6, wherein the obtaining module is specifically configured to:
determining a client side for sending the audio packet according to a client side identifier carried in the audio packet;
determining a terminal corresponding to the client side for sending the audio packet according to the corresponding relation between the client side and the terminal;
an audio level of an audio packet from the terminal and time stamp information of the audio packet are determined.
8. The apparatus of claim 6, wherein the target audio packet extraction module is specifically configured to:
determining frequency information according to the time stamp information, wherein the frequency information comprises the audio level of the acquired audio packet and the frequency of the time stamp information of the audio packet;
and determining the audio packet contained in the set time window as a target audio packet according to the frequency information and the timestamp information.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-5 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN201810669467.3A 2018-06-26 2018-06-26 Method, device, equipment and storage medium for determining speaker terminal in video conference Active CN108924465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810669467.3A CN108924465B (en) 2018-06-26 2018-06-26 Method, device, equipment and storage medium for determining speaker terminal in video conference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810669467.3A CN108924465B (en) 2018-06-26 2018-06-26 Method, device, equipment and storage medium for determining speaker terminal in video conference

Publications (2)

Publication Number Publication Date
CN108924465A CN108924465A (en) 2018-11-30
CN108924465B true CN108924465B (en) 2021-02-09

Family

ID=64422555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810669467.3A Active CN108924465B (en) 2018-06-26 2018-06-26 Method, device, equipment and storage medium for determining speaker terminal in video conference

Country Status (1)

Country Link
CN (1) CN108924465B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109473117B (en) * 2018-12-18 2022-07-05 广州市百果园信息技术有限公司 Audio special effect superposition method and device and terminal thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007195136A (en) * 2005-12-20 2007-08-02 Nippon Telegr & Teleph Corp <Ntt> Video conference system, terminal used for the video conference system, processing method of terminal, and program thereof
CN101080000A (en) * 2007-07-17 2007-11-28 华为技术有限公司 Method, system, server and terminal for displaying speaker in video conference
CN103297743A (en) * 2012-03-05 2013-09-11 联想(北京)有限公司 Video conference display window adjusting method and video conference service equipment
CN104699447A (en) * 2015-03-12 2015-06-10 浙江万朋网络技术有限公司 Voice volume automatic adjustment method based on energy statistics
CN105657329A (en) * 2016-02-26 2016-06-08 苏州科达科技股份有限公司 Video conference system, processing device and video conference method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014059650A1 (en) * 2012-10-18 2014-04-24 华为终端有限公司 Method and apparatus for managing audio file
US9478233B2 (en) * 2013-03-14 2016-10-25 Polycom, Inc. Speech fragment detection for management of interaction in a remote conference
US20150256587A1 (en) * 2014-03-10 2015-09-10 JamKazam, Inc. Network Connection Servers And Related Methods For Interactive Music Systems
CN105592268A (en) * 2016-03-03 2016-05-18 苏州科达科技股份有限公司 Video conferencing system, processing device and video conferencing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007195136A (en) * 2005-12-20 2007-08-02 Nippon Telegr & Teleph Corp <Ntt> Video conference system, terminal used for the video conference system, processing method of terminal, and program thereof
CN101080000A (en) * 2007-07-17 2007-11-28 华为技术有限公司 Method, system, server and terminal for displaying speaker in video conference
CN103297743A (en) * 2012-03-05 2013-09-11 联想(北京)有限公司 Video conference display window adjusting method and video conference service equipment
CN104699447A (en) * 2015-03-12 2015-06-10 浙江万朋网络技术有限公司 Voice volume automatic adjustment method based on energy statistics
CN105657329A (en) * 2016-02-26 2016-06-08 苏州科达科技股份有限公司 Video conference system, processing device and video conference method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
异构网络下实时多媒体传输研究;张历卓;《博士学位论文全文数据库信息科技辑》;20120115;第87-93页第6.1节 *

Also Published As

Publication number Publication date
CN108924465A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
EP2901669B1 (en) Near-end indication that the end of speech is received by the far end in an audio or video conference
CN107396171A (en) Live network broadcast method, device and storage medium
CN104639777A (en) Conference control method, conference control device and conference system
CN113286184B (en) Lip synchronization method for respectively playing audio and video on different devices
US20190051147A1 (en) Remote control method, apparatus, terminal device, and computer readable storage medium
US10574713B2 (en) Self-adaptive sample period for content sharing in communication sessions
CN110225291B (en) Data transmission method and device and computer equipment
US9060033B2 (en) Generation and caching of content in anticipation of presenting content in web conferences
CN112218034A (en) Video processing method, system, terminal and storage medium
CN111385349B (en) Communication processing method, communication processing device, terminal, server and storage medium
CN108833825B (en) Method, device, equipment and storage medium for determining speaker terminal in video conference
US9912617B2 (en) Method and apparatus for voice communication based on voice activity detection
CN108924465B (en) Method, device, equipment and storage medium for determining speaker terminal in video conference
CN110113298A (en) Data transmission method, device, signal server and computer-readable medium
CN112839192A (en) Audio and video communication system and method based on browser
CN113572898B (en) Method and corresponding device for detecting silent abnormality in voice call
CN114242067A (en) Speech recognition method, apparatus, device and storage medium
CN111355919B (en) Communication session control method and device
US9485458B2 (en) Data processing method and device
CN110798700B (en) Video processing method, video processing device, storage medium and electronic equipment
US11557296B2 (en) Communication transfer between devices
US8782271B1 (en) Video mixing using video speech detection
CN111182256A (en) Information processing method and server
US10237402B1 (en) Management of communications between devices
CN114448957B (en) Audio data transmission method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant