CN114974273B

CN114974273B - Conference audio mixing method and device

Info

Publication number: CN114974273B
Application number: CN202110915604.9A
Authority: CN
Inventors: 陈东恺; 庄仁峰; 胡文辉; 郑敏; 吴华挚; 孙立军; 谭俊; 宁志刚; 林德煜; 江继针; 肖聪; 梁祯夏; 阮晓军; 江旭; 刘哲; 潘一佩; 陈锦坤; 李刚; 项华杰; 梁照江
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2023-08-15
Anticipated expiration: 2041-08-10
Also published as: CN114974273A

Abstract

The application discloses a conference audio mixing method and a device, which are used for solving the problem of high processing load of conference audio mixing, and the scheme provided by the application comprises the following steps: receiving a plurality of paths of conference audio to be mixed, wherein the paths of conference audio to be mixed are sent by a plurality of audio devices participating in a conference; performing deblocking on a plurality of paths of conference audio to be mixed; mixing the unpacked multi-path conference audio to obtain mixed conference audio; and packaging the mixed conference audio, wherein the packaged conference audio is used for being output to at least one audio device participating in the conference. The application can intensively execute the mixing of the multi-path conference audio to be mixed, avoid repeatedly executing the mixing and improve the mixing efficiency. In addition, the conference audio after the audio mixing can be multiplexed on a plurality of audio devices, repeated processing is not needed, and the audio mixing processing load is reduced.

Description

Conference audio mixing method and device

Technical Field

The present application relates to the field of communications, and in particular, to a method and apparatus for mixing conference audio.

Background

In the field of communications, multiple users may implement multi-access voice or video conferencing over a network. In a multi-way conference, conference users can communicate through conference audio. In the application scenario of the multi-user conference, the conference audio of the participating user needs to be mixed and output, so that the participating user can hear the voices of other users.

However, in a scenario where the number of participating users is large, if the mixing is performed on each user's voice, the processing load is too high, which may cause resource waste, and delay is long, and the quality of the mixing result is poor, which is difficult to use for real-time conferences.

How to reduce the processing load of conference audio mixing is the technical problem to be solved by the application.

Disclosure of Invention

The embodiment of the application aims to provide a conference audio mixing method and device, which are used for solving the problem of high processing load of conference audio mixing.

In a first aspect, a conference audio mixing method is provided, including:

receiving a plurality of paths of conference audio to be mixed, wherein the paths of conference audio to be mixed are sent by a plurality of audio devices participating in a conference;

performing deblocking on a plurality of paths of conference audio to be mixed;

mixing the unpacked multi-path conference audio to obtain mixed conference audio;

and packaging the mixed conference audio, wherein the packaged conference audio is used for being output to at least one audio device participating in the conference.

In a second aspect, a conference audio mixing apparatus is provided, including:

the receiving module is used for receiving a plurality of paths of conference audio to be mixed, wherein the paths of conference audio to be mixed are sent by a plurality of audio devices participating in a conference;

the deblocking module is used for performing deblocking on a plurality of paths of conference audio to be mixed;

the audio mixing module is used for performing audio mixing on the unpacked multi-path conference audio to obtain mixed conference audio;

and the packaging module is used for packaging the mixed conference audio, and the packaged conference audio is used for being output to at least one audio device participating in the conference.

In a third aspect, there is provided an electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program implementing the steps of the method as in the first aspect when executed by the processor.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as in the first aspect.

In the embodiment of the application, a plurality of paths of conference audio to be mixed are received, and the paths of conference audio to be mixed are sent by a plurality of audio devices participating in a conference; performing deblocking on a plurality of paths of conference audio to be mixed; mixing the unpacked multi-path conference audio to obtain mixed conference audio; and packaging the mixed conference audio, wherein the packaged conference audio is used for being output to at least one audio device participating in the conference. The application can intensively execute the mixing of the multi-path conference audio to be mixed, avoid repeatedly executing the mixing and improve the mixing efficiency. In addition, the conference audio after the audio mixing can be multiplexed on a plurality of audio devices, repeated processing is not needed, and the audio mixing processing load is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1a is a schematic flow chart of a conference audio mixing method according to an embodiment of the present application.

Fig. 1b is a second flowchart of a conference audio mixing method according to an embodiment of the present application.

Fig. 2 is a third flow chart of a conference audio mixing method according to an embodiment of the application.

Fig. 3 is a flow chart of a conference audio mixing method according to an embodiment of the present application.

Fig. 4 is a flowchart of a conference audio mixing method according to an embodiment of the present application.

Fig. 5a is a flowchart illustrating a method for mixing conference audio according to an embodiment of the present application.

Fig. 5b is a flowchart of a conference audio mixing method according to an embodiment of the present application.

Fig. 6 is a flowchart illustrating a method for mixing conference audio according to an embodiment of the present application.

Fig. 7a is a flowchart illustrating a method for mixing conference audio according to an embodiment of the present application.

Fig. 7b is a schematic flow chart of a conference audio mixing method according to an embodiment of the present application.

Fig. 7c is a flowchart illustrating a method for mixing conference audio according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a conference audio mixing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. The reference numerals in the present application are only used for distinguishing the steps in the scheme, and are not used for limiting the execution sequence of the steps, and the specific execution sequence controls the description in the specification.

In order to solve the problems in the prior art, the embodiment of the application provides a conference audio mixing method, and an execution main body of the scheme can be a server or other electronic equipment with a mixing processing function. As shown in fig. 1a, the scheme includes:

s11: and receiving a plurality of paths of conference audio to be mixed, wherein the paths of conference audio to be mixed are sent by a plurality of audio devices participating in the conference.

The audio device participating in the conference may be a device having audio capturing and audio playing functions, and the audio device may directly or indirectly perform transmission of the captured audio. The plurality of audio devices participating in the conference may be different, for example, the audio devices may be a simulated participant entity such as a language conference recorder, a language conference notifier, etc. for supporting application scenarios such as conference recording, global message notification, etc. In this embodiment, the data received by the server side is called a voice downstream, and the data sent by the server side after processing is called a voice upstream. The received conference audio to be mixed may be a network voice stream, and in particular, may refer to a voice data exchange stream based on the RTP protocol (RFC 3550). Referring to fig. 1b, a flow of performing a mixing process on a 4-way network voice stream is shown.

The audio devices participating in the conference can upload collected audio in real time through Internet, bluetooth or other communication modes, and the server can receive conference audio to be mixed in each path through IO threads. The mixing refers to the process of mixing and calculating the multiple independent voice samples, and can synthesize a section of mixed voice. In an application scenario that a plurality of audio devices participate in a conference, conference audio collected by the plurality of audio devices needs to be subjected to audio mixing, and the audio of the conference after audio mixing can enable each audio device participating in the conference to receive voices of other participants, so that conference communication is realized.

S12: and performing deblocking on a plurality of paths of conference audio to be mixed.

In this step, a matched decapsulation mode may be selected according to the format or other attributes of the conference audio to be mixed. For example, real-time transport protocol (Real-time Transport Protocol, RTP) decapsulation may be performed on each path of conference audio to be mixed, and decoding may also be performed according to actual requirements. Referring to fig. 1b, the unpacked multi-way conference audio may be stored in a downstream sample register for retrieval in a subsequent step of performing mixing.

S13: and performing audio mixing on the unpacked multi-path conference audio to obtain the mixed conference audio.

The step can execute the above-mentioned mixing by one or a plurality of threads, optionally, the concentrated mixing is executed on the multi-path conference audio by one thread, the mixed conference audio can be multiplexed to a plurality of audio devices, the mixing is avoided being executed by a plurality of threads, further, the switching consumption caused by too many threads is avoided, and the mixing processing load is effectively reduced.

Referring to fig. 1b, the above-mentioned mixed conference audio may be stored in an upstream speech register for convenient retrieval in a subsequent encapsulation step.

S14: and packaging the mixed conference audio, wherein the packaged conference audio is used for being output to at least one audio device participating in the conference.

In this step, the encapsulation may be performed in a manner matching with the above-described step decapsulation, or may be performed in another manner according to the need. Optionally, in this step, RTP encapsulation is performed, and conference audio after RTP encapsulation can be directly output to the audio device. Alternatively, the packaging or encoding required by the receiving end may be performed on the conference audio after mixing according to the requirements of the audio device of the receiving end.

The scheme provided by the embodiment of the application can be used for realizing the quasi-real-time voice call, the high-quality quasi-real-time voice call requires accurate and uniform arrival and interval of voice data packets, and the voice delay is lower. The scheme provided by the embodiment of the application can effectively reduce the processing load of the mixing by executing the mixing by the same thread, avoid the switching required by the multi-thread mixing processing, and avoid the processing resource waste caused by repeated mixing of the voice by the multi-thread.

Based on the solution provided in the foregoing embodiment, optionally, as shown in fig. 2, before step S13, the method further includes:

s21: and carrying out frame-by-frame sound pressure level preprocessing on each path of the unpacked conference audio so as to determine the sound pressure level corresponding to each path of audio frames of the conference audio.

In this step, the frame-by-frame sound pressure level preprocessing of each path of conference audio may be performed in parallel, and the conference audio after performing the preprocessing may be stored in a downstream sample register so as to be called in the mixing step.

The frame-by-frame sound pressure level preprocessing is used for determining the sound pressure level corresponding to each audio frame in each path of conference audio. Optionally, each of the preprocessed audio frames corresponds to a sound pressure level.

The above sound pressure level is understood to be the so-called volume in decibels. The sound pressure level corresponding to an audio frame can characterize whether the corresponding audio frame is valid, thereby determining whether to perform mixing on the audio frame.

Wherein S13 includes:

s22: and mixing the audio frames with the sound pressure level larger than the first preset sound pressure level threshold value by the same thread to obtain the audio frames after mixing.

In this step, an audio frame for which mixing is performed is selected based on the sound pressure level of the audio frame. The first preset sound pressure level threshold value can be preset according to requirements. An audio frame that is greater than the first preset sound pressure level threshold may be determined to be a voice audio frame of a primary speaker of the conference. And the audio frames smaller than or equal to the first preset sound pressure level threshold value are the audio frames of environmental sound or noise, belong to invalid audio data of the conference, and do not need to perform mixing.

Wherein the step of performing the mixing may be performed based on a high precision timer, the mixing being performed periodically at intervals of time when conference audio is received. According to the scheme provided by the embodiment of the application, on one hand, the quality of a mixing result can be optimized, and on the other hand, the data volume of an audio frame for which mixing is required to be performed can be effectively reduced, so that the mixing processing load is reduced.

Based on the solution provided in the foregoing embodiment, optionally, as shown in fig. 3, S21 includes:

s31: determining the average sound pressure level of the conference audio of the target period corresponding to the first audio frame after the deblocking, wherein the first audio frame belongs to the conference audio of the target period, and the ending time of the target period is the time at which the first audio frame is positioned.

S32: and determining the sound pressure level corresponding to the first audio frame according to the average sound pressure level.

In this step, the average sound pressure level of a section of conference audio is determined as the sound pressure level corresponding to the last audio frame of the section of conference audio. In practical application, the voice has a certain continuity, and the scheme provided by the embodiment can improve the voice integrity after mixing. Moreover, the calculation load for determining the sound pressure level can be effectively reduced, and the overall processing load of the sound mixing can be reduced.

Based on the solution provided in the foregoing embodiment, optionally, as shown in fig. 4, S32 includes:

s41: and if the average sound pressure level is greater than a first preset sound pressure level threshold, determining the sound pressure level corresponding to the first audio frame as the average sound pressure level.

S42: and if the average sound pressure level is smaller than or equal to the first preset sound pressure level threshold, determining that the sound pressure level corresponding to the first audio frame is a second preset sound pressure level threshold, wherein the second preset sound pressure level threshold is smaller than or equal to the first preset sound pressure level threshold.

In the step of determining the sound pressure level of the audio frame, it may be determined that the audio includes voice audio of the conference presenter if the average sound pressure level of the audio is greater than the first preset sound pressure level threshold. And determining the sound pressure level corresponding to the audio frame at the tail of the audio as the average sound pressure level of the audio, so that the audio frame is mixed in the step of mixing, and the integrity of conference voice after mixing is ensured.

If the average sound pressure level of the piece of audio is less than or equal to the first preset sound pressure level threshold, the piece of audio is indicated to not include voice audio of the conference presenter, and may include only partial ambient sound or noise. In this case, the sound pressure level of the audio frame at the end of the piece of audio is determined as the second preset sound pressure level threshold. Because the second preset sound pressure level threshold value is smaller than or equal to the first preset sound pressure level threshold value, the audio with the too small sound pressure level cannot be mixed in the step of mixing, and the mixing processing load can be effectively reduced. Optionally, the second preset sound pressure level threshold may be-1. Since the theoretical sound pressure level range is greater than 0, setting the sound pressure level of an audio frame to-1 can effectively mark the audio frame as an invalid audio frame, avoiding performing mixing on it in the mixing step.

In this step, the audio frame can be marked by determining the sound pressure level corresponding to the audio frame. Specifically, an audio frame corresponding to a sound pressure level greater than a first preset sound pressure level threshold is a valid audio frame, and mixing is performed in a subsequent step. And the audio frames corresponding to the sound pressure level less than or equal to the first preset sound pressure level threshold are invalid audio frames, possibly being audio frames of ambient sound or noise, without performing mixing in a subsequent step. The method provided by the embodiment of the application can optimize the mixing processing result and reduce the mixing processing load.

Based on the solution provided in the foregoing embodiment, optionally, as shown in fig. 5a, if the average sound pressure level is greater than a first preset sound pressure level threshold, the method further includes:

s51: the average sound pressure level is stored.

In this step, the average sound pressure level may be stored in the buffer. In addition, for ease of recall, the average sound pressure level may also be labeled as a buffer sound pressure level.

Wherein S21 includes:

s52: and when the time difference between the time when the average sound pressure level is stored and the time when the second audio frame obtained by deblocking is located is smaller than the preset time length, determining that the sound pressure level corresponding to the second audio frame is the stored average sound pressure level.

In this step, the preset time length may be preset according to the actual requirement. The processing load can be effectively reduced by setting a longer preset time length, and the mixing processing quality can be effectively improved by setting a shorter preset time length. If the time difference between the time of the second audio frame obtained by deblocking and the time of storing the sound pressure level in the buffer is smaller than the preset time length, the sound pressure level corresponding to the second audio can be determined to be the stored buffer sound pressure level. The scheme provided by the embodiment can improve the efficiency of determining the sound pressure level corresponding to each audio frame and reduce the processing load.

Referring to fig. 5b, in the scheme provided in this embodiment, the IO thread receives multiple paths of voice data to be mixed, and then determines an average sound pressure level of each path of voice data. For any one path of voice data, RTP deblocking and audio decoding are firstly performed to obtain an original sample frame of the path of voice data, and the original sample frame is used for determining corresponding sound pressure level in a subsequent step. Then, the sound pressure level of the voice entities participating in the conference is cumulatively averaged based on a period of time. If the average sound pressure level determined in a short time is stored in the buffer, the average sound pressure level stored in the buffer is directly determined as the average sound pressure level of the current frame. Otherwise, calculating the average sound pressure level of the voice data in one end time taking the current frame as the end time. If the calculated average sound pressure level exceeds the threshold value, determining the calculated average sound pressure level as the average sound pressure level of the current frame, and storing the calculated average sound pressure level in a cache. If the calculated average sound pressure level is smaller than or equal to the threshold value, setting the average sound pressure level of the current frame to be-1, and storing the-1 into a cache. Through the steps, the average sound pressure level of each path of voice data is determined frame by frame so as to obtain the sound pressure level corresponding to each frame.

Based on the solution provided in the foregoing embodiment, optionally, as shown in fig. 6, S22 includes:

s61: and ordering the audio frames with the sound pressure level larger than the first preset sound pressure level threshold value from large to small.

In this step, the audio frames are ordered according to the corresponding sound pressure levels, so that the audio frames can be ordered according to the importance degree. Wherein the audio frames with higher sound pressure levels are more likely to be speech audio frames of the conference presenter, i.e. of greater importance.

S62: and mixing the audio frames of the preset number in the sequencing result by the same thread to obtain the audio frames after mixing.

The step further performs screening on the audio frames for performing audio mixing based on the sorting result, and selects the preset number of audio frames in front of the sorting result to perform audio mixing, so that the processing load of audio mixing can be further reduced. Because the sorting result can represent the importance degree of the audio frames in conference voice, by selecting the audio frames with the front sorting to perform the mixing, the voice audio of the conference speaker can be kept in the mixing result, and meanwhile, the mixing result is prevented from comprising environmental sound or noise, and the quality of the mixing result is ensured.

In practical applications, in most multi-path conferences, the true effective speech is usually only two or three paths, and most non-sounding participants only need to receive the speech of the speaker. Whereas the noise data of the unvoiced participant is typically of a small sound pressure level and does not last, virtually without participating in the mixing calculation. Therefore, based on the reality of the above scene, the preset number is set in the scheme, and only the audio frames with the average sound pressure level larger than the first preset sound pressure level threshold value and the preset number before the ranking from large to small are selected for mixing in the scheme. The first preset sound pressure level threshold and the preset number can be configured in a personalized mode under different scenes, so that under the condition that the quality of a better sound mixing result is guaranteed, the load of a processor can be greatly reduced, and the throughput of a server side is improved.

The scheme provided by the embodiment can be used for efficiently and conveniently embedding the simulation participant entities such as the 'language conference recorder',the 'language conference notifier', and the like based on the structure of the actual conference equipment, and supporting the scenes such as conference recording, global message notification, and the like.

Based on the solution provided in the foregoing embodiment, optionally, as shown in fig. 7a, S22 includes:

s71: mixing rule information of a target audio device of the plurality of audio devices participating in the conference is determined.

In some application scenarios, some audio devices among the plurality of audio devices participating in the conference require special mixing processing. In this step, mixing rule information of the target audio device is determined, where the mixing rule information may be sent to the server by the target audio device, or may be determined by the server according to information query of the target audio device. In practical application, it may be determined whether each audio device participating in the conference has corresponding mixing rule information, respectively.

S72: according to the mixing rule information, performing mixing on an audio frame with a sound pressure level greater than a first preset sound pressure level threshold value in conference audio sent by the target audio equipment to obtain a first mixing frame;

s73: and performing audio mixing on the audio frames, which are transmitted by the target audio equipment and have sound pressure levels greater than a first preset sound pressure level threshold, outside the conference audio according to preset audio mixing rule information to obtain a second audio mixing frame.

The steps of executing the step S72 and the step S73 may be exchanged, and the two steps may be executed simultaneously. And when the mixing rule information of the target audio equipment participating in the conference shows that special mixing is required to be executed, performing mixing on audio frames, of which the sound pressure level is greater than a first preset sound pressure level threshold, in the conference audio sent by the target audio equipment, wherein the specific rule of mixing processing is determined based on the mixing rule information. The obtained mixing frame accords with the mixing rule information of the target audio equipment, and personalized special mixing can be realized.

For audio devices that do not need to perform special mixing, mixing is performed in step S73 on audio frames having sound pressure levels greater than a first preset sound pressure level threshold. The specific step of the mixing process is determined according to preset mixing rule information, and the preset mixing rule can be set according to actual requirements.

Wherein S14 includes:

s74: and executing RTP encapsulation on the first mixed sound frame and the second mixed sound frame, wherein the encapsulated conference audio containing the first mixed sound frame and the second mixed sound frame is used for being output to the target audio equipment.

In this step, by performing RTP encapsulation on the first mix frame and the second mix frame, the encapsulation result can be made to include the first mix frame performing special mixing and the second mix frame performing mixing according to the general mixing rule. The packaging result is used for being sent to the target audio equipment, and special mixing requirements of the target audio equipment are met.

The scheme provided by the embodiment judges and distinguishes the special mixing requirements from the common mixing requirements, wherein the special mixing can represent that multiple users who are speaking users need separate mixing, the common mixing can represent other users who do not speak but only listen in a conference, and the mixing data output to the users are same and reusable. According to the scheme provided by the embodiment, the mixing result can be obtained through mixing by the same thread once, the mixing result can be multiplexed for multiple times, repeated mixing calculation is avoided, and mixing processing load is effectively reduced.

In addition, in the scheme provided by the embodiment, whether the conference member entity is empty or not can be determined before the audio mixing is executed, so that whether the equipment receives the audio after the audio mixing is judged, and if the equipment does not receive the audio mixing, the audio mixing processing is not needed for the received audio.

Optionally, after performing the frame-by-frame sound pressure level preprocessing on each path of conference audio after the decapsulation, the method may further include the following steps:

and storing the audio frames of each path of conference audio and the corresponding sound pressure level association to a downlink sample register.

The sound pressure level preprocessing of each path of conference audio can be executed in parallel, and each path of audio frames obtained by preprocessing can be stored in a downlink sample register in a centralized way, so that the audio frames can be conveniently called during mixing.

The method for mixing audio frames with sound pressure level greater than a first preset sound pressure level threshold value by the same thread to obtain mixed audio frames comprises the following steps:

and executing sound mixing on the audio frames stored in the downlink sample register and having the sound pressure level greater than a first preset sound pressure level threshold value by the same thread to obtain mixed audio frames, wherein the mixed audio frames are used for being output to at least one uplink voice register, and the uplink voice register corresponds to audio equipment participating in a conference.

In the solution provided in this embodiment, the uplink voice registers may correspond to the audio devices participating in the conference one by one. The data stored in the upstream voice registers may be periodically encapsulated and sent to the corresponding audio device.

Wherein, executing RTP encapsulation to the audio of the conference after the audio mixing, further comprising:

and executing RTP encapsulation on the audio frames after the audio mixing stored in the uplink voice register, wherein the encapsulated conference audio is used for being output to audio equipment corresponding to the uplink voice register.

The RTP encapsulation executed in this step may be periodically executed according to a predetermined time length, and after the encapsulation is completed, the RTP encapsulation may be directly sent to the corresponding audio device.

As shown in fig. 7b, the mixing processing step provided in this embodiment may be periodically triggered based on a preset duration. Firstly, each frame to be mixed is obtained from an uplink sample register, and the first n audio frames are selected according to the sequence from big to small of the corresponding sound pressure level. Wherein n may be a positive integer representing the number of mixing paths, and n may be preset according to requirements.

Then, the member entities participating in the conference are traversed, whether the member entities receive the audio after the mixing is judged, and if the member entities do not receive the audio after the mixing, the mixing processing is not needed to be executed. If at least one member entity receives the audio after the audio mixing, judging whether the member needs to execute the special audio mixing, if so, acquiring the special audio mixing rule of the member and executing the special audio mixing on the member, and performing common audio mixing processing on the audio of other members according to the preset rule. And outputting the mixing result of the special mixing to an uplink voice register of a member needing the special mixing.

And outputting the result of performing the common mixing to an uplink voice register corresponding to the member for the member which does not need to perform the special mixing. And the mixing result stored in the uplink voice register is actively acquired by the conference member, and the mixing audio cached in the uplink voice register is deleted after the mixing result is acquired. If the buffer is empty, indicating that the previous mixing result has been obtained by the conference member, the new mixing result may be stored in the buffer. If the buffer is not empty, the member can wait for obtaining the buffered mixing result until the buffer is empty, and then store the new mixing result into the buffer.

In addition, when RTP encapsulation is performed, after the audio frame stored in the uplink voice register is acquired, the data in the voice register may be deleted. After the audio frame after the audio mixing is obtained by the audio mixing, judging whether the uplink voice register is empty or not, and if so, storing the audio frame after the audio mixing into the uplink voice register. The problem that data in the uplink voice register are covered by data obtained by new mixing when the data are not packaged and sent is avoided, audio loss is avoided, and continuity and integrity of audio data received by the receiving end audio device are guaranteed.

As shown in fig. 7c, the duration of performing encapsulation may be preset, and RTP encapsulation may be performed by periodically acquiring the audio frame after audio mixing from the uplink speech register based on the preset duration, and outputting the encapsulated speech data to the conference member through the IO thread.

In order to solve the problems in the prior art, an embodiment of the present application further provides a conference audio mixing apparatus 80, as shown in fig. 8, including:

a receiving module 81 for receiving a plurality of paths of conference audio to be mixed, wherein the paths of conference audio to be mixed are sent by a plurality of audio devices participating in a conference;

a deblocking module 82, configured to perform deblocking on a plurality of paths of conference audio to be mixed;

the audio mixing module 83 performs audio mixing on the unpacked multi-path conference audio to obtain mixed conference audio;

and a packaging module 84, configured to perform packaging on the audio of the conference after mixing, where the packaged audio of the conference is used for being output to at least one audio device participating in the conference.

Preferably, the embodiment of the present application further provides an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, where the computer program when executed by the processor implements each process of the above embodiment of a conference audio mixing method, and the same technical effects can be achieved, and for avoiding repetition, a description is omitted herein.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the above-mentioned conference audio mixing method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method for mixing conference audio, comprising:

performing deblocking on a plurality of paths of conference audio to be mixed;

performing frame-by-frame sound pressure level preprocessing on each path of conference audio after deblocking to determine sound pressure levels corresponding to audio frames of each path of conference audio respectively;

mixing audio frames with sound pressure level greater than a first preset sound pressure level threshold value by the same thread to obtain audio frames of the mixed conference audio;

2. The method of claim 1, wherein performing frame-by-frame sound pressure level pre-processing for each path of conference audio after decapsulation comprises:

determining the average sound pressure level of target period conference audio corresponding to a first audio frame after deblocking, wherein the first audio frame belongs to the target period conference audio, and the ending time of the target period is the time at which the first audio frame is positioned;

and determining the sound pressure level corresponding to the first audio frame according to the average sound pressure level.

3. The method of claim 2, wherein determining the sound pressure level corresponding to the first audio frame from the average sound pressure level comprises:

if the average sound pressure level is greater than a first preset sound pressure level threshold, determining that the sound pressure level corresponding to the first audio frame is the average sound pressure level;

and if the average sound pressure level is smaller than or equal to the first preset sound pressure level threshold, determining that the sound pressure level corresponding to the first audio frame is a second preset sound pressure level threshold, wherein the second preset sound pressure level threshold is smaller than or equal to the first preset sound pressure level threshold.

4. The method of claim 3, wherein if the average sound pressure level is greater than a first preset sound pressure level threshold, further comprising:

storing the average sound pressure level;

the method for processing the audio frames of the conference audio comprises the steps of performing frame-by-frame sound pressure level preprocessing on each path of the conference audio after deblocking to determine sound pressure levels corresponding to the audio frames of each path of the conference audio respectively, and comprises the following steps:

and when the time difference between the time when the average sound pressure level is stored and the time when the second audio frame obtained by deblocking is located is smaller than the preset time length, determining that the sound pressure level corresponding to the second audio frame is the stored average sound pressure level.

5. The method of any of claims 1-4, wherein mixing audio frames having a sound pressure level greater than a first preset sound pressure level threshold with the same thread to obtain audio frames of mixed conference audio comprises:

ordering the audio frames with the sound pressure level larger than a first preset sound pressure level threshold value from large to small;

and mixing the audio frames of the preset number in the sequencing result by the same thread to obtain the audio frames of the mixed conference audio.

6. The method of any of claims 1-4, wherein mixing audio frames having a sound pressure level greater than a first preset sound pressure level threshold with the same thread to obtain audio frames of mixed conference audio comprises:

determining mixing rule information of a target audio device of the plurality of audio devices participating in the conference;

according to the mixing rule information, performing mixing on an audio frame with a sound pressure level greater than a first preset sound pressure level threshold value in conference audio sent by the target audio equipment to obtain a first mixing frame;

performing audio mixing on audio frames, which are transmitted by the target audio equipment and have sound pressure levels greater than a first preset sound pressure level threshold, except conference audio according to preset audio mixing rule information to obtain a second audio mixing frame;

wherein, carry out the encapsulation to the meeting audio after the mixing, include:

and executing RTP encapsulation on the first mixed sound frame and the second mixed sound frame, wherein the encapsulated conference audio containing the first mixed sound frame and the second mixed sound frame is used for being output to the target audio equipment.

7. A conference audio mixing apparatus, comprising:

the audio mixing module is used for performing frame-by-frame sound pressure level preprocessing on each path of conference audio after the audio mixing module is used for determining sound pressure levels corresponding to audio frames of each path of conference audio respectively; mixing audio frames with sound pressure level greater than a first preset sound pressure level threshold value by the same thread to obtain audio frames of the mixed conference audio;

8. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.