CN112104836A - Audio mixing method, system, storage medium and equipment for audio server - Google Patents

Audio mixing method, system, storage medium and equipment for audio server Download PDF

Info

Publication number
CN112104836A
CN112104836A CN202011316513.5A CN202011316513A CN112104836A CN 112104836 A CN112104836 A CN 112104836A CN 202011316513 A CN202011316513 A CN 202011316513A CN 112104836 A CN112104836 A CN 112104836A
Authority
CN
China
Prior art keywords
audio
mixing
audio data
data
state group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011316513.5A
Other languages
Chinese (zh)
Inventor
梁启东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
G Net Cloud Service Co Ltd
Original Assignee
G Net Cloud Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by G Net Cloud Service Co Ltd filed Critical G Net Cloud Service Co Ltd
Priority to CN202011316513.5A priority Critical patent/CN112104836A/en
Publication of CN112104836A publication Critical patent/CN112104836A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/765Media network packet handling intermediate
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/80Responding to QoS

Abstract

The invention provides an audio server audio mixing method, an audio server audio mixing system, a storage medium and audio mixing equipment, which are applied to a centralized audio mixing server, wherein the audio mixing method comprises the following steps: receiving voice packets of participants from a plurality of audio channels, and decoding the voice packets into audio data; detecting sound active states in the audio data, and dividing the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group; mixing the audio data of the active state group to form a mixed sound data packet; and coding the audio mixing data packets corresponding to audio channels respectively and sending the audio mixing data packets to the participants. The audio mixing method, the audio mixing system, the storage medium and the audio mixing equipment solve the problems of high bandwidth and various and complex decoding formats of the audio server, improve the voice effect and reduce the calculation pressure of a processor.

Description

Audio mixing method, system, storage medium and equipment for audio server
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a method, a system, a storage medium, and an apparatus for audio mixing in an audio server.
Background
With the development of science and technology and the global era, the traditional face-to-face conference can not meet the actual requirements, more and more audio and video conferences are adopted by various large enterprises and organizations so as to meet the actual requirement of 'people can be opened without the arrival', the actual expenses of the enterprises and individuals are greatly saved, the real-time communication efficiency is improved, and the working efficiency is improved to a certain degree. However, when the real-time audio-video communication conference is used, participants of the conference basically enter the conference at different places, some people speak in the conference, and some people do not speak in a listening state all the time in the conference, for each speaker, the speakers except the speaker who does not speak in the listening state and the other speakers who do not speak in the listening state need to receive the sounds of all speakers, and in the 6-person group real-time audio-video communication conference, if 5 persons are in a speaking state and 1 person is in a listening state and do not speak, as shown in fig. 2, each arrow represents one audio stream, it can be seen that each speaker needs to receive the sounds of the other 4 participants, and for the participants who do not speak in the listening state, the 5 speakers need to receive the sounds of the other 5 speakers. For the server, only 5 speakers and participants who do not speak in listening status need 5 uplink audio streams and 25 downlink audio streams, and the pressure of the server is great. Meanwhile, for the existing audio server, high bandwidth is needed to forward multiple audio streams, which is a large investment in material and consumption, and the real-time audio and video communication conference terminal of the existing audio server needs to support the decoding format to deal with multiple audio streams, so that a lot of cost is added in terminal selection, and the popularization and application of the audio and video conference system are not facilitated.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide an audio server mixing method, system, storage medium and device, which are used to solve the problems of high bandwidth and complicated decoding formats of the audio server in the prior art.
In order to achieve the above objects and other related objects, the present invention discloses an audio mixing method for an audio server, comprising:
receiving voice packets of participants from a plurality of audio channels, and decoding the voice packets into audio data;
detecting sound active states in the audio data, and dividing the audio data into an active state group and an inactive state group;
discarding the audio data of the inactive state group;
mixing the audio data of the active state group to form a mixed sound data packet;
and coding the audio mixing data packets corresponding to audio channels respectively and sending the audio mixing data packets to the participants.
In this solution, the receiving the voice packet of the participant includes: and putting the received voice packet into a receiving buffer area, comparing the timestamps, and discarding the voice packet arriving beyond the preset time.
In this scheme, mixing the audio data of the active state group includes:
when the audio data of the active state group is less than or equal to n paths, mixing all the audio data of the active state group;
when the audio data of the active state group is larger than n paths, selecting n paths of audio data with larger sound energy for sound mixing;
wherein n is a preset natural number greater than 1.
In this scheme, mixing audio data of the active state group further includes:
dividing the audio data of the active state group into historical active voice and current active voice according to whether the audio channel corresponding to the audio data participates in last audio mixing, wherein the historical active voice is the audio data of the audio channel participating in last audio mixing, and the current active voice is the audio data of the audio channel not participating in last audio mixing;
when the audio data of the active state group is less than or equal to n paths, carrying out sound mixing on the current active voice and carrying out fade-in processing;
when the audio data of the active state group is larger than n paths, mixing the current active voice in the n paths of audio data with larger sound energy and performing fade-in processing; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.
In the scheme, the fade-in processing means that the volume is gradually increased from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero.
In the scheme, the audio server is a centralized sound mixing server, and the sound mixing data packet comprises voice frames with the same frame size and frequency.
In the scheme, when the audio mixing data packets are coded respectively corresponding to the audio channels, the coding formats of different audio channels are different.
To achieve the above and other related objects, the present invention also provides an audio server system, comprising:
the receiving module receives the voice packet of the participant from the multi-path audio channel and decodes the voice packet into audio data;
the detection module is used for detecting the sound active state in the audio data, dividing the audio data into an active state group and an inactive state group, and discarding the audio data of the inactive state group;
the audio mixing module is used for mixing audio data of the active state group to form an audio mixing data packet;
and the sending module is used for coding the audio mixing data packets respectively corresponding to audio channels and sending the audio mixing data packets to the participants.
The mixed voice data packet comprises voice frames with the same frame size and frequency.
To achieve the above and other related objects, the present invention provides a computer-readable storage medium as described above, on which a computer program is stored, which when executed by a processor, implements the above-described audio server mixing method.
To achieve the above and other related objects, the present invention provides an apparatus as described above, comprising: the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the apparatus to perform the audio server mixing method.
As described above, the audio server audio mixing method, system, storage medium and device of the present invention select the audio data with large voice energy for audio mixing, so that the amount of CPU computation in the server does not increase with the increase of speakers, and also does not make it unclear when multiple speakers speak simultaneously, the amount of CPU computation is reduced while the voice effect is improved, and the corresponding encoding format can be specified when the audio data is sent to the corresponding speakers, so that all decoding need not be supported for decoding the real-time audio/video communication conference, only the corresponding decoder needs to be supported, and the audio streams of all speakers need not be received, only one audio stream with good audio service mixing is received, and accordingly, the bandwidth requirement of the real-time audio/video communication conference is also reduced.
Drawings
Fig. 1 is a flowchart illustrating steps of an audio server mixing method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating a connection between a terminal and a mixing server according to an embodiment of an audio server mixing method of the present invention;
fig. 3 is a schematic diagram of a terminal interconnection structure of an audio server mixing method according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of an audio server mixing system according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a mixing flow of an audio server mixing method according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1, in an embodiment of the present invention, an audio server mixing method includes:
step S11, receiving the voice packet of the conferee from the multi-channel audio channel, and decoding the voice packet into audio data.
Specifically, when a meeting is opened, the participants speak in a dispute due to their respective main meetings, the voices of the participants are collected by using voice receivers such as microphones and bellmouths to form voice packets of the participants, the voice packets are received by the audio server and transmitted in the link, and since the voice signals are analog signals, digital signals are required in the circuit transmission process, and the voice packets are decoded by the decoder to become audio data, such as PCM data, which can be transmitted in the link. The PCM is Pulse Code Modulation, i.e. Pulse Code Modulation, which is one of the coding modes of digital communication, and the main process is to sample analog signals such as voice, image and the like at regular intervals to discretize the analog signals, round and quantize the sample values according to the hierarchical unit, express the amplitude of the sample Pulse according to a group of binary codes, sample, quantize and encode the input analog signals in the PCM process, express the amplitude of the analog signals by the number coded by the binary, and restore the codes to the original analog signals by the receiving end, i.e. the A/D conversion of digital audio comprises three processes: sampling, quantizing and encoding.
Optionally, in this step, the received voice packet is placed in a receiving buffer, the timestamp comparison is performed, the voice packet arriving after exceeding the preset time is discarded, and then the voice packet in the buffer is decoded into audio data. Thus, confusion of sound contents due to transmission delay can be avoided.
Step S12, detecting the sound active state in the audio data, and dividing the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group.
In the process of a real-time audio and video communication conference, voice in a voice period and a silent period exist alternately, the speaking time of a speaker does not exceed half of the total conversation time generally in terms of statistical rules, voice processing is not carried out when the speaker does not speak, and only a packet which marks silence is sent, so that a large amount of network bandwidth is saved, and the computing resource of a CPU is saved. In this step, the audio data is divided according to the audio state thereof, and noise and some secondary audio contents of the conference participants are removed for effectively transmitting the audio contents of the current primary conference participants. Specifically, the multiple paths of audio data are grouped according to the result of the sound activity detection, and are divided into an active state group and an inactive state group, wherein the audio data including the speaking content of the current participant is divided into the active state group, and the audio data is divided into the inactive state group for the audio channels which do not detect sound activity or have the conditions of poor sound quality such as noise, too low volume, too low definition and the like. And the active state group is transmitted backwards to wait for the next mixing, and for the audio data of the inactive state group, as the content has no need of transmission, the audio data is discarded after being divided, and does not participate in the subsequent mixing, thereby effectively improving the transmission quality and the accuracy of sound.
Step S13, mixing the audio data of the active state group to form a mixed sound data packet;
although in a real-time audio and video communication conference, generally, a situation that a plurality of speakers make sounds at the same time does not occur, considering the reason of network delay, the behavior reason of individual participants, and the practical situation of the conference, if the voice data of a plurality of speakers are all mixed when a plurality of real-time audio and video communication conferences make sounds at the same time, due to the randomness of each path of voice, the mixing result is disordered, the voice of each speaker cannot be clearly heard, and the user experience is very poor.
To this end, the steps further comprise:
when the audio data of the active state group is less than or equal to n paths, mixing all the audio data of the active state group;
when the audio data of the active state group is larger than n paths, selecting n paths of audio data with larger sound energy for sound mixing;
wherein n is a preset natural number greater than 1. It is found through testing that when the number of paths participating in mixing is greater than "3", the noise sensation becomes more pronounced as the number of paths increases, and therefore n is optionally "3".
Further, mixing processing is continuously performed during the conference, since the speaker may change, the audio channel participating in mixing processing each time may be different, and in order to make the sound transition more natural, the experience of the participants is optimized, the method further includes:
dividing the audio data of the active state group into historical active voice and current active voice according to whether the audio channel corresponding to the audio data participates in last audio mixing, wherein the historical active voice is the audio data of the audio channel participating in last audio mixing, and the current active voice is the audio data of the audio channel not participating in last audio mixing;
when the audio data of the active state group is less than or equal to n paths, performing fade-in processing on the current active voice;
when the audio data of the active state group is larger than n paths, performing fade-in processing on the current active voice in the n paths of audio data with larger sound energy; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.
Where n is a predetermined natural number greater than 1, such as 3.
The fade-in processing refers to that the volume is gradually increased from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero. The initial value is a predetermined value smaller than a normal value, for example, 0 or one-half of the normal value.
And step S14, encoding the mixing data packets corresponding to the audio channels respectively, and sending the encoded mixing data packets to the participants.
Specifically, the audio mixing data packet includes speech frames with the same frame size and frequency, and for different audio channels, the audio mixing data packet may be encoded and transmitted by using an encoding method corresponding to the audio channel. The coding formats of the different audio channels may be the same or different. Therefore, the user terminal does not need to support all decoders, and only needs to support one decoding mode corresponding to the audio channel.
In order to implement the audio server mixing method, as shown in fig. 4, the present embodiment correspondingly provides an audio server system 40, which is applied to a centralized mixing server. As shown in fig. 3, the audio server system 40 is configured to mix audio streams, where each arrow represents one audio stream, and for a speaker, the uplink 1 audio stream and the downlink 1 audio stream are provided, and for a participant who does not speak in a listening state, only the downlink 1 audio stream is provided, and for an audio server, only the uplink 5 audio streams and the downlink 6 audio streams are provided, which greatly reduces the bandwidth pressure of the server.
The system 40 comprises:
a receiving module 41, configured to receive the voice packet of the conferee from the multiple audio channels, and decode the voice packet into audio data.
Optionally, the module puts the received voice packet into a receiving buffer, compares the timestamps, discards the voice packet arriving after the preset time is exceeded, and then decodes the voice packet in the buffer into audio data. Thus, confusion of sound contents due to transmission delay can be avoided.
A detecting module 42, configured to detect a sound active state in the audio data, and divide the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group.
The module divides the audio data according to the audio state of the audio data so as to effectively transmit the audio content of the current main conference participants and eliminate the noise and the audio content of some secondary conference participants. Specifically, the multiple paths of audio data are grouped according to the result of the sound activity detection, and are divided into an active state group and an inactive state group, wherein the audio data including the speaking content of the current participant is divided into the active state group, and the audio data is divided into the inactive state group for the audio channels which do not detect sound activity or have the conditions of poor sound quality such as noise, too low volume, too low definition and the like. And the active state group is transmitted backwards to wait for the next mixing, and for the audio data of the inactive state group, as the content has no need of transmission, the audio data is discarded after being divided, and does not participate in the subsequent mixing, thereby effectively improving the transmission quality and the accuracy of sound.
And a mixing module 43, configured to mix audio data of the active state group to form a mixed audio data packet.
Specifically, mixing the audio data of the active state group includes:
when the audio data of the active state group is less than or equal to n paths, mixing all the audio data of the active state group;
when the audio data of the active state group is larger than n paths, selecting n paths of audio data with larger sound energy for sound mixing;
where n is a predetermined natural number greater than 1, such as 3.
In order to make the sound transition more natural and optimize the experience of the participants, mixing the audio data of the active state group further comprises:
dividing the audio data of the active state group into historical active voice and current active voice according to whether the audio channel corresponding to the audio data participates in last audio mixing, wherein the historical active voice is the audio data of the audio channel participating in last audio mixing, and the current active voice is the audio data of the audio channel not participating in last audio mixing;
when the audio data of the active state group is less than or equal to n paths, performing fade-in processing on the current active voice;
when the audio data of the active state group is larger than n paths, performing fade-in processing on the current active voice in the n paths of audio data with larger sound energy; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.
Where n is a predetermined natural number greater than 1, such as 3.
The fade-in processing refers to that the volume is gradually increased from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero. The initial value is a predetermined value smaller than a normal value, for example, 0 or one-half of the normal value.
A sending module 44, configured to encode the audio mixing data packets respectively corresponding to audio channels and send the encoded audio mixing data packets to the participants.
Specifically, the audio mixing data packet includes speech frames with the same frame size and frequency, and for different audio channels, the audio mixing data packet may be encoded and transmitted by using an encoding method corresponding to the audio channel. The coding formats of the different audio channels may be the same or different. Therefore, the user terminal does not need to support all decoders, and only needs to support one decoding mode corresponding to the audio channel.
Fig. 5 shows a specific embodiment of an audio server mixing method of the present invention. And for multi-channel audio data obtained by decoding voice packets of the multi-channel audio channel, detecting voice activity, dividing the multi-channel audio data into an active state group and an inactive state group according to a detection result, and discarding the audio data of the inactive state group.
Judging whether the audio data of the active state group is greater than 3 paths, and mixing all the audio data of the active state group when the audio data of the active state group is less than or equal to 3 paths; and when the audio data of the active state group is larger than 3 paths, selecting 3 paths of audio data with larger sound energy for mixing.
Dividing the audio data of the active state group into historical active voice and current active voice according to whether the audio channel corresponding to the audio data participates in last audio mixing, wherein the historical active voice is the audio data of the audio channel participating in last audio mixing, and the current active voice is the audio data of the audio channel not participating in last audio mixing;
when the audio data of the active state group is less than or equal to 3 paths, mixing the current active voice and performing fade-in processing, and mixing the historical active voice normally;
when the audio data of the active state group is larger than 3 paths, mixing the current active voice in the 3 paths of audio data with larger sound energy and performing fade-in processing, wherein the historical active voice is mixed normally; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.
And respectively adopting coding modes corresponding to the audio channels to code and transmit different audio channels.
In addition, the present invention provides a readable storage medium having stored thereon a computer program connected via a bus, the program, when executed by a processor, implementing any of the audio server mixing methods.
In addition, the present invention further provides an apparatus, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so as to enable the apparatus to execute any one of the audio server mixing methods, and each part of the system is defined, referred, calculated and stored through a setting program in the apparatus.
The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may include a random access memory (random access memory) and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor may be a general-purpose processor, and includes a Central Processing Unit (CPU), a network processor (network processor), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In summary, as shown in fig. 5, in the present invention, data of all speakers are not linearly superimposed, but only the data of the speakers with large speech energy is selected for mixing, so that the amount of CPU computation is not increased with the increase of the speakers, and is not unclear when multiple speakers speak simultaneously, and the amount of CPU computation is reduced while the speech effect is improved; meanwhile, when the audio stream is sent to a corresponding speaker, the corresponding coding format can be specified, so that all decoding is not required to be supported for decoding the real-time audio and video communication conference, and only a corresponding decoder is required to be supported, so that the adaptability is greatly improved, and the audio stream processing method and the device are applicable to different mobile terminals. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. An audio server mixing method, comprising:
receiving voice packets of participants from a plurality of audio channels, and decoding the voice packets into audio data;
detecting sound active states in the audio data, and dividing the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group;
mixing the audio data of the active state group to form a mixed sound data packet;
and coding the audio mixing data packets corresponding to audio channels respectively and sending the audio mixing data packets to the participants.
2. The audio server mixing method according to claim 1, wherein the receiving the voice packets of the conference participants comprises: and putting the received voice packet into a receiving buffer area, comparing the timestamps, and discarding the voice packet arriving beyond the preset time.
3. The audio server mixing method according to claim 1, wherein mixing the audio data of the active state group comprises:
when the audio data of the active state group is less than or equal to n paths, mixing all the audio data of the active state group;
when the audio data of the active state group is larger than n paths, selecting n paths of audio data with larger sound energy for sound mixing;
wherein n is a preset natural number greater than 1.
4. The audio server mixing method according to claim 3, wherein mixing the audio data of the active state group further comprises:
dividing the audio data of the active state group into historical active voice and current active voice according to whether the audio channel corresponding to the audio data participates in last audio mixing, wherein the historical active voice is the audio data of the audio channel participating in last audio mixing, and the current active voice is the audio data of the audio channel not participating in last audio mixing;
when the audio data of the active state group is less than or equal to n paths, carrying out sound mixing on the current active voice and carrying out fade-in processing; when the audio data of the active state group is larger than n paths, mixing the current active voice in the n paths of audio data with larger sound energy and performing fade-in processing; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.
5. The audio server mixing method according to claim 4, wherein the fade-in process is a process in which a volume is increased stepwise from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero.
6. The audio server mixing method according to claim 5, wherein the audio server is a centralized mixing server, and the mixing data packet includes voice frames with the same frame size and frequency.
7. The audio server mixing method according to claim 1, wherein when the mixing packets are encoded corresponding to audio channels, encoding formats of different audio channels are different.
8. An audio server system, comprising:
the receiving module receives the voice packet of the participant from the multi-path audio channel and decodes the voice packet into audio data;
the detection module is used for detecting the sound active state in the audio data, dividing the audio data into an active state group and an inactive state group, and discarding the audio data of the inactive state group;
the audio mixing module is used for mixing audio data of the active state group to form an audio mixing data packet;
and the sending module is used for coding the audio mixing data packets respectively corresponding to audio channels and sending the audio mixing data packets to the participants.
9. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the audio server mixing method of any one of claims 1 to 7.
10. A mixing apparatus, characterized in that the mixing apparatus comprises a memory for storing a computer program and a processor for executing the computer program stored by the memory to cause the apparatus to perform the audio server mixing method according to any one of claims 1 to 7.
CN202011316513.5A 2020-11-23 2020-11-23 Audio mixing method, system, storage medium and equipment for audio server Pending CN112104836A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011316513.5A CN112104836A (en) 2020-11-23 2020-11-23 Audio mixing method, system, storage medium and equipment for audio server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011316513.5A CN112104836A (en) 2020-11-23 2020-11-23 Audio mixing method, system, storage medium and equipment for audio server

Publications (1)

Publication Number Publication Date
CN112104836A true CN112104836A (en) 2020-12-18

Family

ID=73785505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011316513.5A Pending CN112104836A (en) 2020-11-23 2020-11-23 Audio mixing method, system, storage medium and equipment for audio server

Country Status (1)

Country Link
CN (1) CN112104836A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112751849A (en) * 2020-12-28 2021-05-04 视联动力信息技术股份有限公司 Conference control method, device, equipment and medium based on video networking
CN113114688A (en) * 2021-04-15 2021-07-13 杭州朗和科技有限公司 Multimedia conference management method and device, storage medium and electronic equipment
CN113727061A (en) * 2021-11-01 2021-11-30 江苏怀业信息技术股份有限公司 Audio and video synchronization method for video conference system
CN113824726A (en) * 2021-09-26 2021-12-21 北京沃东天骏信息技术有限公司 Online conference method, device and system
CN114173011A (en) * 2021-11-29 2022-03-11 河北远东通信系统工程有限公司 Audio mixing control method for collaborative command media engine
CN114512139A (en) * 2022-04-18 2022-05-17 杭州星犀科技有限公司 Processing method and system for multi-channel audio mixing, mixing processor and storage medium
CN114974273A (en) * 2021-08-10 2022-08-30 中移互联网有限公司 Conference audio mixing method and device
CN116471263A (en) * 2023-05-12 2023-07-21 杭州全能数字科技有限公司 Real-time audio routing method for video system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252452A (en) * 2007-03-31 2008-08-27 红杉树(杭州)信息技术有限公司 Distributed type tone mixing system in multimedia conference
CN102436818A (en) * 2011-10-25 2012-05-02 浙江万朋网络技术有限公司 Routing and overdubbing method for server end based on priority of energy
CN105743911A (en) * 2016-03-30 2016-07-06 武汉随锐亿山科技有限公司 Method for improving mixing capacity of video session system
CN107800902A (en) * 2017-09-15 2018-03-13 北京容联易通信息技术有限公司 The sound mixing method and system of multi-path voice
CN109901811A (en) * 2019-02-26 2019-06-18 北京华夏电通科技有限公司 Applied to the sound mixing method and device in digitlization court's trial
CN109920445A (en) * 2019-03-04 2019-06-21 北京佳讯飞鸿电气股份有限公司 A kind of sound mixing method, device and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252452A (en) * 2007-03-31 2008-08-27 红杉树(杭州)信息技术有限公司 Distributed type tone mixing system in multimedia conference
CN102436818A (en) * 2011-10-25 2012-05-02 浙江万朋网络技术有限公司 Routing and overdubbing method for server end based on priority of energy
CN105743911A (en) * 2016-03-30 2016-07-06 武汉随锐亿山科技有限公司 Method for improving mixing capacity of video session system
CN107800902A (en) * 2017-09-15 2018-03-13 北京容联易通信息技术有限公司 The sound mixing method and system of multi-path voice
CN109901811A (en) * 2019-02-26 2019-06-18 北京华夏电通科技有限公司 Applied to the sound mixing method and device in digitlization court's trial
CN109920445A (en) * 2019-03-04 2019-06-21 北京佳讯飞鸿电气股份有限公司 A kind of sound mixing method, device and equipment

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112751849B (en) * 2020-12-28 2024-04-09 视联动力信息技术股份有限公司 Conference control method, device, equipment and medium based on video networking
CN112751849A (en) * 2020-12-28 2021-05-04 视联动力信息技术股份有限公司 Conference control method, device, equipment and medium based on video networking
CN113114688A (en) * 2021-04-15 2021-07-13 杭州朗和科技有限公司 Multimedia conference management method and device, storage medium and electronic equipment
CN114974273A (en) * 2021-08-10 2022-08-30 中移互联网有限公司 Conference audio mixing method and device
CN114974273B (en) * 2021-08-10 2023-08-15 中移互联网有限公司 Conference audio mixing method and device
CN113824726A (en) * 2021-09-26 2021-12-21 北京沃东天骏信息技术有限公司 Online conference method, device and system
CN113824726B (en) * 2021-09-26 2023-09-05 北京沃东天骏信息技术有限公司 Online conference method, device and system
CN113727061A (en) * 2021-11-01 2021-11-30 江苏怀业信息技术股份有限公司 Audio and video synchronization method for video conference system
CN114173011A (en) * 2021-11-29 2022-03-11 河北远东通信系统工程有限公司 Audio mixing control method for collaborative command media engine
CN114173011B (en) * 2021-11-29 2024-03-19 河北远东通信系统工程有限公司 Mixing control method for collaborative command media engine
CN114512139A (en) * 2022-04-18 2022-05-17 杭州星犀科技有限公司 Processing method and system for multi-channel audio mixing, mixing processor and storage medium
CN116471263A (en) * 2023-05-12 2023-07-21 杭州全能数字科技有限公司 Real-time audio routing method for video system
CN116471263B (en) * 2023-05-12 2024-02-13 杭州全能数字科技有限公司 Real-time audio routing method for video system

Similar Documents

Publication Publication Date Title
CN112104836A (en) Audio mixing method, system, storage medium and equipment for audio server
US20140205115A1 (en) Audio Mixing Method, Apparatus and System
EP1360798B1 (en) Control unit for multipoint multimedia/audio conference
EP2786552B1 (en) Method to select active channels in audio mixing for multi-party teleconferencing
US8787547B2 (en) Selective audio combination for a conference
US8340269B2 (en) Method, device, and system for processing multi-channel audio signals
CN110995946B (en) Sound mixing method, device, equipment, system and readable storage medium
KR20080065236A (en) Multimedia conferencing method and signal
CN110070878B (en) Decoding method of audio code stream and electronic equipment
US20110091029A1 (en) Distributed multi-party conferencing system
CN102915736A (en) Sound mixing processing method and system
US8515039B2 (en) Method for carrying out a voice conference and voice conference system
EP2158753B1 (en) Selection of audio signals to be mixed in an audio conference
EP2572499A1 (en) Encoder adaption in teleconferencing system
CN1845573A (en) Simultaneous interpretation video conference system and method for supporting high capacity mixed sound
US11800017B1 (en) Encoding a subset of audio input for broadcasting conferenced communications
US20030174657A1 (en) Method, system and computer program product for voice active packet switching for IP based audio conferencing
US7058026B1 (en) Internet teleconferencing
US10375131B2 (en) Selectively transforming audio streams based on audio energy estimate
CN111951821B (en) Communication method and device
TW515191B (en) Voice code switching method, voice code switching device and voice communication terminal
Chodrow et al. Design and implementation of a multicast audio conferencing tool for a collaborative computing framework
EP4037339A1 (en) Selecton of audio channels based on prioritization
EP0982921A1 (en) Technique for effective distribution of audio signals in a teleconference
CN116156099A (en) Network transmission method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201218