CN112104836A

CN112104836A - Audio mixing method, system, storage medium and equipment for audio server

Info

Publication number: CN112104836A
Application number: CN202011316513.5A
Authority: CN
Inventors: 梁启东
Original assignee: G Net Cloud Service Co Ltd
Current assignee: G Net Cloud Service Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2020-12-18

Abstract

The invention provides an audio server audio mixing method, an audio server audio mixing system, a storage medium and audio mixing equipment, which are applied to a centralized audio mixing server, wherein the audio mixing method comprises the following steps: receiving voice packets of participants from a plurality of audio channels, and decoding the voice packets into audio data; detecting sound active states in the audio data, and dividing the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group; mixing the audio data of the active state group to form a mixed sound data packet; and coding the audio mixing data packets corresponding to audio channels respectively and sending the audio mixing data packets to the participants. The audio mixing method, the audio mixing system, the storage medium and the audio mixing equipment solve the problems of high bandwidth and various and complex decoding formats of the audio server, improve the voice effect and reduce the calculation pressure of a processor.

Description

Audio mixing method, system, storage medium and equipment for audio server

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to a method, a system, a storage medium, and an apparatus for audio mixing in an audio server.

Background

With the development of science and technology and the global era, the traditional face-to-face conference can not meet the actual requirements, more and more audio and video conferences are adopted by various large enterprises and organizations so as to meet the actual requirement of 'people can be opened without the arrival', the actual expenses of the enterprises and individuals are greatly saved, the real-time communication efficiency is improved, and the working efficiency is improved to a certain degree. However, when the real-time audio-video communication conference is used, participants of the conference basically enter the conference at different places, some people speak in the conference, and some people do not speak in a listening state all the time in the conference, for each speaker, the speakers except the speaker who does not speak in the listening state and the other speakers who do not speak in the listening state need to receive the sounds of all speakers, and in the 6-person group real-time audio-video communication conference, if 5 persons are in a speaking state and 1 person is in a listening state and do not speak, as shown in fig. 2, each arrow represents one audio stream, it can be seen that each speaker needs to receive the sounds of the other 4 participants, and for the participants who do not speak in the listening state, the 5 speakers need to receive the sounds of the other 5 speakers. For the server, only 5 speakers and participants who do not speak in listening status need 5 uplink audio streams and 25 downlink audio streams, and the pressure of the server is great. Meanwhile, for the existing audio server, high bandwidth is needed to forward multiple audio streams, which is a large investment in material and consumption, and the real-time audio and video communication conference terminal of the existing audio server needs to support the decoding format to deal with multiple audio streams, so that a lot of cost is added in terminal selection, and the popularization and application of the audio and video conference system are not facilitated.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide an audio server mixing method, system, storage medium and device, which are used to solve the problems of high bandwidth and complicated decoding formats of the audio server in the prior art.

In order to achieve the above objects and other related objects, the present invention discloses an audio mixing method for an audio server, comprising:

receiving voice packets of participants from a plurality of audio channels, and decoding the voice packets into audio data;

detecting sound active states in the audio data, and dividing the audio data into an active state group and an inactive state group;

discarding the audio data of the inactive state group;

mixing the audio data of the active state group to form a mixed sound data packet;

and coding the audio mixing data packets corresponding to audio channels respectively and sending the audio mixing data packets to the participants.

In this solution, the receiving the voice packet of the participant includes: and putting the received voice packet into a receiving buffer area, comparing the timestamps, and discarding the voice packet arriving beyond the preset time.

In this scheme, mixing the audio data of the active state group includes:

when the audio data of the active state group is less than or equal to n paths, mixing all the audio data of the active state group;

when the audio data of the active state group is larger than n paths, selecting n paths of audio data with larger sound energy for sound mixing;

wherein n is a preset natural number greater than 1.

In this scheme, mixing audio data of the active state group further includes:

dividing the audio data of the active state group into historical active voice and current active voice according to whether the audio channel corresponding to the audio data participates in last audio mixing, wherein the historical active voice is the audio data of the audio channel participating in last audio mixing, and the current active voice is the audio data of the audio channel not participating in last audio mixing;

when the audio data of the active state group is less than or equal to n paths, carrying out sound mixing on the current active voice and carrying out fade-in processing;

when the audio data of the active state group is larger than n paths, mixing the current active voice in the n paths of audio data with larger sound energy and performing fade-in processing; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.

In the scheme, the fade-in processing means that the volume is gradually increased from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero.

In the scheme, the audio server is a centralized sound mixing server, and the sound mixing data packet comprises voice frames with the same frame size and frequency.

In the scheme, when the audio mixing data packets are coded respectively corresponding to the audio channels, the coding formats of different audio channels are different.

To achieve the above and other related objects, the present invention also provides an audio server system, comprising:

the receiving module receives the voice packet of the participant from the multi-path audio channel and decodes the voice packet into audio data;

the detection module is used for detecting the sound active state in the audio data, dividing the audio data into an active state group and an inactive state group, and discarding the audio data of the inactive state group;

the audio mixing module is used for mixing audio data of the active state group to form an audio mixing data packet;

and the sending module is used for coding the audio mixing data packets respectively corresponding to audio channels and sending the audio mixing data packets to the participants.

The mixed voice data packet comprises voice frames with the same frame size and frequency.

To achieve the above and other related objects, the present invention provides a computer-readable storage medium as described above, on which a computer program is stored, which when executed by a processor, implements the above-described audio server mixing method.

To achieve the above and other related objects, the present invention provides an apparatus as described above, comprising: the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the apparatus to perform the audio server mixing method.

As described above, the audio server audio mixing method, system, storage medium and device of the present invention select the audio data with large voice energy for audio mixing, so that the amount of CPU computation in the server does not increase with the increase of speakers, and also does not make it unclear when multiple speakers speak simultaneously, the amount of CPU computation is reduced while the voice effect is improved, and the corresponding encoding format can be specified when the audio data is sent to the corresponding speakers, so that all decoding need not be supported for decoding the real-time audio/video communication conference, only the corresponding decoder needs to be supported, and the audio streams of all speakers need not be received, only one audio stream with good audio service mixing is received, and accordingly, the bandwidth requirement of the real-time audio/video communication conference is also reduced.

Drawings

Fig. 1 is a flowchart illustrating steps of an audio server mixing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a connection between a terminal and a mixing server according to an embodiment of an audio server mixing method of the present invention;

fig. 3 is a schematic diagram of a terminal interconnection structure of an audio server mixing method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an audio server mixing system according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a mixing flow of an audio server mixing method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, in an embodiment of the present invention, an audio server mixing method includes:

step S11, receiving the voice packet of the conferee from the multi-channel audio channel, and decoding the voice packet into audio data.

Specifically, when a meeting is opened, the participants speak in a dispute due to their respective main meetings, the voices of the participants are collected by using voice receivers such as microphones and bellmouths to form voice packets of the participants, the voice packets are received by the audio server and transmitted in the link, and since the voice signals are analog signals, digital signals are required in the circuit transmission process, and the voice packets are decoded by the decoder to become audio data, such as PCM data, which can be transmitted in the link. The PCM is Pulse Code Modulation, i.e. Pulse Code Modulation, which is one of the coding modes of digital communication, and the main process is to sample analog signals such as voice, image and the like at regular intervals to discretize the analog signals, round and quantize the sample values according to the hierarchical unit, express the amplitude of the sample Pulse according to a group of binary codes, sample, quantize and encode the input analog signals in the PCM process, express the amplitude of the analog signals by the number coded by the binary, and restore the codes to the original analog signals by the receiving end, i.e. the A/D conversion of digital audio comprises three processes: sampling, quantizing and encoding.

Optionally, in this step, the received voice packet is placed in a receiving buffer, the timestamp comparison is performed, the voice packet arriving after exceeding the preset time is discarded, and then the voice packet in the buffer is decoded into audio data. Thus, confusion of sound contents due to transmission delay can be avoided.

Step S12, detecting the sound active state in the audio data, and dividing the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group.

In the process of a real-time audio and video communication conference, voice in a voice period and a silent period exist alternately, the speaking time of a speaker does not exceed half of the total conversation time generally in terms of statistical rules, voice processing is not carried out when the speaker does not speak, and only a packet which marks silence is sent, so that a large amount of network bandwidth is saved, and the computing resource of a CPU is saved. In this step, the audio data is divided according to the audio state thereof, and noise and some secondary audio contents of the conference participants are removed for effectively transmitting the audio contents of the current primary conference participants. Specifically, the multiple paths of audio data are grouped according to the result of the sound activity detection, and are divided into an active state group and an inactive state group, wherein the audio data including the speaking content of the current participant is divided into the active state group, and the audio data is divided into the inactive state group for the audio channels which do not detect sound activity or have the conditions of poor sound quality such as noise, too low volume, too low definition and the like. And the active state group is transmitted backwards to wait for the next mixing, and for the audio data of the inactive state group, as the content has no need of transmission, the audio data is discarded after being divided, and does not participate in the subsequent mixing, thereby effectively improving the transmission quality and the accuracy of sound.

Step S13, mixing the audio data of the active state group to form a mixed sound data packet;

although in a real-time audio and video communication conference, generally, a situation that a plurality of speakers make sounds at the same time does not occur, considering the reason of network delay, the behavior reason of individual participants, and the practical situation of the conference, if the voice data of a plurality of speakers are all mixed when a plurality of real-time audio and video communication conferences make sounds at the same time, due to the randomness of each path of voice, the mixing result is disordered, the voice of each speaker cannot be clearly heard, and the user experience is very poor.

To this end, the steps further comprise:

wherein n is a preset natural number greater than 1. It is found through testing that when the number of paths participating in mixing is greater than "3", the noise sensation becomes more pronounced as the number of paths increases, and therefore n is optionally "3".

Further, mixing processing is continuously performed during the conference, since the speaker may change, the audio channel participating in mixing processing each time may be different, and in order to make the sound transition more natural, the experience of the participants is optimized, the method further includes:

when the audio data of the active state group is less than or equal to n paths, performing fade-in processing on the current active voice;

when the audio data of the active state group is larger than n paths, performing fade-in processing on the current active voice in the n paths of audio data with larger sound energy; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.

Where n is a predetermined natural number greater than 1, such as 3.

The fade-in processing refers to that the volume is gradually increased from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero. The initial value is a predetermined value smaller than a normal value, for example, 0 or one-half of the normal value.

And step S14, encoding the mixing data packets corresponding to the audio channels respectively, and sending the encoded mixing data packets to the participants.

Specifically, the audio mixing data packet includes speech frames with the same frame size and frequency, and for different audio channels, the audio mixing data packet may be encoded and transmitted by using an encoding method corresponding to the audio channel. The coding formats of the different audio channels may be the same or different. Therefore, the user terminal does not need to support all decoders, and only needs to support one decoding mode corresponding to the audio channel.

In order to implement the audio server mixing method, as shown in fig. 4, the present embodiment correspondingly provides an audio server system 40, which is applied to a centralized mixing server. As shown in fig. 3, the audio server system 40 is configured to mix audio streams, where each arrow represents one audio stream, and for a speaker, the uplink 1 audio stream and the downlink 1 audio stream are provided, and for a participant who does not speak in a listening state, only the downlink 1 audio stream is provided, and for an audio server, only the uplink 5 audio streams and the downlink 6 audio streams are provided, which greatly reduces the bandwidth pressure of the server.

The system 40 comprises:

a receiving module 41, configured to receive the voice packet of the conferee from the multiple audio channels, and decode the voice packet into audio data.

Optionally, the module puts the received voice packet into a receiving buffer, compares the timestamps, discards the voice packet arriving after the preset time is exceeded, and then decodes the voice packet in the buffer into audio data. Thus, confusion of sound contents due to transmission delay can be avoided.

A detecting module 42, configured to detect a sound active state in the audio data, and divide the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group.

The module divides the audio data according to the audio state of the audio data so as to effectively transmit the audio content of the current main conference participants and eliminate the noise and the audio content of some secondary conference participants. Specifically, the multiple paths of audio data are grouped according to the result of the sound activity detection, and are divided into an active state group and an inactive state group, wherein the audio data including the speaking content of the current participant is divided into the active state group, and the audio data is divided into the inactive state group for the audio channels which do not detect sound activity or have the conditions of poor sound quality such as noise, too low volume, too low definition and the like. And the active state group is transmitted backwards to wait for the next mixing, and for the audio data of the inactive state group, as the content has no need of transmission, the audio data is discarded after being divided, and does not participate in the subsequent mixing, thereby effectively improving the transmission quality and the accuracy of sound.

And a mixing module 43, configured to mix audio data of the active state group to form a mixed audio data packet.

Specifically, mixing the audio data of the active state group includes:

where n is a predetermined natural number greater than 1, such as 3.

In order to make the sound transition more natural and optimize the experience of the participants, mixing the audio data of the active state group further comprises:

Where n is a predetermined natural number greater than 1, such as 3.

A sending module 44, configured to encode the audio mixing data packets respectively corresponding to audio channels and send the encoded audio mixing data packets to the participants.

Fig. 5 shows a specific embodiment of an audio server mixing method of the present invention. And for multi-channel audio data obtained by decoding voice packets of the multi-channel audio channel, detecting voice activity, dividing the multi-channel audio data into an active state group and an inactive state group according to a detection result, and discarding the audio data of the inactive state group.

Judging whether the audio data of the active state group is greater than 3 paths, and mixing all the audio data of the active state group when the audio data of the active state group is less than or equal to 3 paths; and when the audio data of the active state group is larger than 3 paths, selecting 3 paths of audio data with larger sound energy for mixing.

when the audio data of the active state group is less than or equal to 3 paths, mixing the current active voice and performing fade-in processing, and mixing the historical active voice normally;

when the audio data of the active state group is larger than 3 paths, mixing the current active voice in the 3 paths of audio data with larger sound energy and performing fade-in processing, wherein the historical active voice is mixed normally; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.

And respectively adopting coding modes corresponding to the audio channels to code and transmit different audio channels.

In addition, the present invention provides a readable storage medium having stored thereon a computer program connected via a bus, the program, when executed by a processor, implementing any of the audio server mixing methods.

In addition, the present invention further provides an apparatus, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so as to enable the apparatus to execute any one of the audio server mixing methods, and each part of the system is defined, referred, calculated and stored through a setting program in the apparatus.

The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may include a random access memory (random access memory) and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor may be a general-purpose processor, and includes a Central Processing Unit (CPU), a network processor (network processor), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

In summary, as shown in fig. 5, in the present invention, data of all speakers are not linearly superimposed, but only the data of the speakers with large speech energy is selected for mixing, so that the amount of CPU computation is not increased with the increase of the speakers, and is not unclear when multiple speakers speak simultaneously, and the amount of CPU computation is reduced while the speech effect is improved; meanwhile, when the audio stream is sent to a corresponding speaker, the corresponding coding format can be specified, so that all decoding is not required to be supported for decoding the real-time audio and video communication conference, and only a corresponding decoder is required to be supported, so that the adaptability is greatly improved, and the audio stream processing method and the device are applicable to different mobile terminals. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An audio server mixing method, comprising:

detecting sound active states in the audio data, and dividing the audio data into an active state group and an inactive state group; discarding the audio data of the inactive state group;

2. The audio server mixing method according to claim 1, wherein the receiving the voice packets of the conference participants comprises: and putting the received voice packet into a receiving buffer area, comparing the timestamps, and discarding the voice packet arriving beyond the preset time.

3. The audio server mixing method according to claim 1, wherein mixing the audio data of the active state group comprises:

wherein n is a preset natural number greater than 1.

4. The audio server mixing method according to claim 3, wherein mixing the audio data of the active state group further comprises:

when the audio data of the active state group is less than or equal to n paths, carrying out sound mixing on the current active voice and carrying out fade-in processing; when the audio data of the active state group is larger than n paths, mixing the current active voice in the n paths of audio data with larger sound energy and performing fade-in processing; and mixing the sound of the history active voice in other audio data and performing fade-out processing, discarding the current active voice in other audio data, and not mixing the sound.

5. The audio server mixing method according to claim 4, wherein the fade-in process is a process in which a volume is increased stepwise from an initial value to a normal value; the fade-out process is a process in which the volume is gradually reduced from a normal value to zero.

6. The audio server mixing method according to claim 5, wherein the audio server is a centralized mixing server, and the mixing data packet includes voice frames with the same frame size and frequency.

7. The audio server mixing method according to claim 1, wherein when the mixing packets are encoded corresponding to audio channels, encoding formats of different audio channels are different.

8. An audio server system, comprising:

9. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the audio server mixing method of any one of claims 1 to 7.

10. A mixing apparatus, characterized in that the mixing apparatus comprises a memory for storing a computer program and a processor for executing the computer program stored by the memory to cause the apparatus to perform the audio server mixing method according to any one of claims 1 to 7.