CN110675885B

CN110675885B - Sound mixing method, device and storage medium

Info

Publication number: CN110675885B
Application number: CN201910990310.5A
Authority: CN
Inventors: 陈烈; 史巍
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2022-03-22
Anticipated expiration: 2039-10-17
Also published as: CN110675885A

Abstract

The invention discloses a sound mixing method, a sound mixing device and a storage medium. The sound mixing method comprises the following steps: acquiring multi-channel audio data; respectively judging whether each path of audio data participates in sound mixing by at least utilizing a Gaussian model-based voice activity detection algorithm and a historical judgment result; selecting a mixing algorithm by using the quantity of the audio data participating in mixing; and mixing the audio data participating in the mixing by adopting the selected mixing algorithm. Through the mode, the multi-channel audio mixing method and the device can support multi-channel audio mixing and improve the audio mixing effect.

Description

Sound mixing method, device and storage medium

Technical Field

The present application relates to the field of speech processing, and in particular, to a sound mixing method, apparatus, and storage medium.

Background

In a multimedia interactive system, when a plurality of participants speak at a certain time, each participant needs to be able to hear the voice of each speaker, which requires that the system supports a real-time audio mixing function, and has certain requirements on audio quality and real-time performance of audio mixing.

A common mixing method is to directly superimpose audio data of different channels, which may cause overlap overflow, and meanwhile, some audio data may not be voice, which affects audio quality, especially when the amount of audio data is large. Technologies such as Voice Activity Detection (VAD) and the like can be adopted to screen out Voice from audio data and then perform superposition, but the simpler screening error rate is likely to be larger, and the audio quality is affected; too complex screening may bring about a large algorithm delay, resulting in a sudden or even intermittent volume of the output signal.

Disclosure of Invention

The application provides a sound mixing method, a sound mixing device and a storage medium, which can solve the problem of unsatisfactory sound mixing effect in the prior art.

In order to solve the technical problem, the application adopts a technical scheme that: acquiring multi-channel audio data; respectively judging whether each path of audio data participates in sound mixing by at least utilizing a Gaussian model-based voice activity detection algorithm and a historical judgment result; selecting a mixing algorithm by using the quantity of the audio data participating in mixing; and mixing the audio data participating in the mixing by adopting the selected mixing algorithm.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a mixing apparatus comprising a processor for executing instructions to implement the mixing method described above.

In order to solve the above technical problem, the present application adopts another technical solution that: there is provided a storage medium storing instructions that, when executed, implement the aforementioned mixing method.

The beneficial effect of this application is: obtaining multi-channel audio data; respectively judging whether each path of audio data participates in sound mixing by at least utilizing a Gaussian model-based voice activity detection algorithm and a historical judgment result; selecting a mixing algorithm by using the quantity of the audio data participating in mixing; the selected audio mixing algorithm is adopted to mix audio of the audio data participating in the audio mixing, and the screening algorithm of the audio mixing data is optimized, so that the time delay is low and the accuracy is high; meanwhile, the number of the audio data participating in the sound mixing is used for selecting a sound mixing algorithm, the number of the paths supporting the sound mixing is increased, and a good sound mixing effect can be achieved when the number of the paths is large or small.

Drawings

FIG. 1 is a schematic flow chart of a first embodiment of the mixing method of the present invention;

FIG. 2 is a schematic view of the detailed process of S2 in FIG. 1;

FIG. 3 is a schematic diagram of the detailed process of S21 in FIG. 2;

FIG. 4 is a flow chart of a second embodiment of the mixing method of the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of an audio mixing apparatus according to the present invention;

FIG. 6 is a schematic structural diagram of an embodiment of a storage medium according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. All directional indications (such as up, down, left, right, front, and rear … …) in the embodiments of the present application are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indication is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Fig. 1 is a schematic flow chart of a first embodiment of the mixing method of the present invention. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 1 is not limited in this embodiment. As shown in fig. 1, the present embodiment includes:

s1: and acquiring multi-channel audio data.

The audio signal is collected from a sound source (which may be a participant of the multimedia interactive system). To facilitate the signal, the audio signal is divided into a plurality of frames. The audio data may be a current frame of an audio signal (referred to as an audio current frame for short), or may be a denoised audio current frame.

S2: and respectively judging whether each path of audio data participates in sound mixing at least by utilizing a Gaussian model-based voice activity detection algorithm and a historical judgment result.

As shown in fig. 2, in an embodiment of the present invention, S2 may specifically include:

s21: and judging whether the audio data is voice by using a voice activity detection algorithm based on a Gaussian model to obtain a current judgment result.

Whether speech exists is characterized by probability by adopting a Gaussian mixture model, namely whether the audio data is speech is evaluated based on likelihood ratios calculated by using characteristic values rather than the characteristic values.

As shown in fig. 3, S21 may specifically include:

s211: audio data is divided into a plurality of sub-bands by frequency.

The audio data may be converted to the frequency domain and then divided into a plurality of sub-bands in the frequency domain. Alternatively, the audio data may be input to a plurality of bandpass filters having different passbands to obtain a plurality of sub-bands.

S212: the energy-entropy ratio of each sub-band and the likelihood ratio that the sub-band is speech are calculated separately.

The energy-entropy ratio of a sub-band refers to the ratio of the energy of the sub-band to the spectral entropy. Features of the subbands may be extracted and likelihood ratios for the subbands to speech may be calculated based on the features. The features may include at least one of energy, frequency domain, cepstrum, harmonics, long-term information, and the like, without limitation.

S213: and judging whether the audio data is voice or not by combining the likelihood ratio and the energy-entropy ratio of all the sub-frequency bands to obtain a current judgment result.

Specifically, the likelihood ratio of each sub-band may be compared with a first threshold, the energy-entropy ratio may be compared with a third threshold, and then the number of sub-bands whose likelihood ratio is greater than the first threshold and the number of sub-bands whose energy-entropy ratio is greater than the third threshold may be counted.

The decision method can have two kinds, one of them can be: if the number of the sub-frequency bands with the likelihood ratio larger than the first threshold is larger than the second threshold and the number of the sub-frequency bands with the energy entropy ratio larger than the third threshold is larger than the fourth threshold, the current judgment result is voice, and if not, the current judgment result is non-voice.

The other method can be as follows: if the number of the sub-frequency bands with the likelihood ratio larger than the first threshold is larger than the second threshold or the number of the sub-frequency bands with the energy-entropy ratio larger than the third threshold is larger than the fourth threshold, the current judgment result is voice, otherwise, the current judgment result is non-voice.

The first threshold and/or the third threshold may be fixed or variable. For example, the first threshold and/or the third threshold may be selected from several preset values, or dynamically adjusted, depending on the environment in which the participant is located.

Of course, the statistical values of the likelihood ratio and/or the energy-entropy ratio of all the sub-bands, such as an average value or a weighted average value, may also be calculated, and then compared with the preset threshold to obtain the current decision result.

The spectral entropy is only related to the randomness of energy and is not related to the amplitude of the energy, so that the voice and the non-voice can be better distinguished, and a large amount of operation is avoided. Meanwhile, the energy can inhibit the interference of burst noise. The entropy ratio is combined with the likelihood ratio, the operation efficiency is high, the time delay is low, the robustness to noise is good, and the judgment accuracy is high.

S22: and judging whether the audio data participates in sound mixing or not by combining the current judgment result and the historical judgment result.

The historical decision result refers to the decision result of the previous frame. In general, the historical decision result may include a previous decision result, which is the most adjacent one of the historical decision results to the current decision result. Assuming that the current frame is the nth frame, the previous decision result is the decision result of the (n-1) th frame, the previous second decision result is the decision result of the (n-2) th frame, and so on.

In the case where the history decision result includes the previous decision result, the fade-in and fade-out may be implemented in the following judgment manner:

and if the current judgment result and the previous judgment result are both non-voice, the audio data does not participate in sound mixing.

If the current judgment result is non-voice and the previous judgment result is voice, the audio data is multiplied by the first coefficient group in the time domain and participates in sound mixing, the value range of the first coefficient group is 0 to 1, and the coefficients in the group become smaller in sequence. Sequential here means in time domain in order from front to back.

And if the current judgment result and the previous judgment result are both voices, the audio data participates in sound mixing.

If the current judgment result is voice and the previous judgment result is non-voice, the audio data is multiplied by the second coefficient group in the time domain and participates in sound mixing, the value range of the second coefficient group is 0 to 1, and the coefficients in the group become larger in sequence.

In the case that the historical decision result includes the previous decision result and the previous decision result, the fade-in and fade-out may be implemented by adopting the following judgment manner, as described above, where the previous decision result and the previous decision result are two of the historical decision results that are most adjacent to the current decision result:

and if the current judgment result, the previous judgment result and the previous second judgment result are non-voice, the audio data does not participate in sound mixing.

If the current decision result and the previous decision result are non-speech and the previous second decision result is speech, the audio data is multiplied by a third coefficient group in the time domain and participates in sound mixing, the value range of the third coefficient group is 0 to 0.5, and the coefficients in the group become smaller in sequence.

If the current judgment result is non-voice and the previous judgment result are voice, the audio data is multiplied by a fourth coefficient group in the time domain and participates in sound mixing, the value range of the fourth coefficient group is 0.5-1, and the coefficients in the group become smaller in sequence.

And if the current judgment result, the previous judgment result and the previous second judgment result are all voices, the audio data participates in sound mixing.

If the current judgment result and the previous judgment result are voices and the previous judgment result is non-voice, the audio data is multiplied by a fifth coefficient group in the time domain and participates in sound mixing, the value range of the fifth coefficient group is 0.5-1, and the coefficients in the group become larger in sequence.

If the current judgment result is voice and the previous judgment result are non-voice, the audio data is multiplied by a sixth coefficient group in the time domain and participates in sound mixing, the value range of the sixth coefficient group is 0 to 0.5, and the coefficients in the group become larger in sequence.

Alternatively, the coefficients within a coefficient group may be partially or fully equal. If no cross fade is required, the audio data may optionally be left unmodified.

In addition, the importance of the audio data may be further considered in determining whether to participate in the mixing. For example, if a certain path of audio data comes from a specified channel, the path of audio data directly participates in audio mixing regardless of the current decision result and the historical decision result. The designated channel may be a channel of an important participant whose importance, or priority, is higher than the non-designated channel. For example, the designated channel may be a judge channel in a court system, or a teacher channel in a teaching system.

S3: a mixing algorithm is selected using the amount of audio data participating in mixing.

If the quantity of the audio data participating in sound mixing is larger than a preset threshold value, selecting a first sound mixing algorithm to perform sound mixing; otherwise, a second mixing algorithm is selected. The first mixing algorithm is more complex than the second mixing algorithm and the effect of the first mixing algorithm is better than that of the second mixing algorithm. The preset threshold may be decided according to the performance of the first mixing algorithm. Under the condition that the number of audio data participating in sound mixing is large, the quality of the sound mixing data obtained by adopting the first sound mixing algorithm is superior to that of the sound mixing data obtained by adopting the second algorithm.

For example, the preset threshold may be 5, the first mixing algorithm is an adaptive weighting method, and the second mixing algorithm is a volume expansion clipping method. When the amount of audio data participating in audio mixing is less than or equal to 5, the probability of overflow is low, and a simple and efficient capacity expansion limiting method can be adopted; when the number of audio data participating in audio mixing is greater than 5, a more complex but more effective adaptive weighting method is adopted to ensure the audio quality. In other embodiments, other algorithms may be selected as the first/second mixing algorithm, such as linear superposition.

S4: and mixing the audio data participating in the mixing by adopting the selected mixing algorithm.

And mixing the sound to obtain mixed sound data.

The mixing method provided by the present embodiment may be performed for each frame of a multi-channel audio signal, thereby continuously outputting mixed data.

Through the implementation of the embodiment, the screening algorithm of the mixed sound data is optimized, so that the time delay is low and the accuracy is high; meanwhile, the number of the audio data participating in the sound mixing is used for selecting a sound mixing algorithm, the number of the paths supporting the sound mixing is increased, and a good sound mixing effect can be achieved when the number of the paths is large or small.

Fig. 4 is a flow chart of a second embodiment of the mixing method of the present invention. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 4 is not limited in this embodiment. The second embodiment of the mixing method of the present invention is a specific extension of the first embodiment of the mixing method of the present invention, and the same parts are not repeated. As shown in fig. 4, the present embodiment includes:

s110: and controlling the state of a sound source acquisition switch according to the current scene.

If the multimedia interactive system only allows certain specific participants to speak at certain times, for example in a court, only judges/lawyers/witnesses are allowed to speak at certain times. The state of the sound source collection switch can be controlled according to the current scene, and only the sound of the participants who are allowed to speak is collected. Optionally, a request for turning on a sound source collecting switch may be received, and whether to accept the request may be determined according to the current scene.

S120: receiving a plurality of audio current frames.

S130: and denoising the audio frames of the multi-channel audio.

S140: it is determined whether the audio data is from a specified channel.

If yes, jumping to S150; otherwise, the process jumps to S160.

S150: the audio data participates in the audio mixing.

S160: and judging whether the audio data is voice by using a voice activity detection algorithm based on a Gaussian model to obtain a current judgment result.

S170: and judging whether the audio data participates in sound mixing or not by combining the current judgment result and the historical judgment result.

S140-S170 are performed for each audio data and then a jump is made to S180.

In other embodiments, the order of S140-S150 and S160-S170 may be changed, that is, after obtaining the current decision result, it is determined whether the audio data is from the specified channel, and if so, it is directly determined that the audio data participates in audio mixing without combining the historical decision result.

S180: the number of audio data participating in the audio mixing is compared with a preset threshold.

If the number of the audio data participating in the audio mixing is larger than a preset threshold value, jumping to S190; otherwise, jumping to S200.

S190: and selecting a first mixing algorithm to mix the audio data participating in the mixing.

S200: and selecting a second mixing algorithm to mix the audio data participating in the mixing.

The first mixing algorithm is more complex than the second mixing algorithm and the effect of the first mixing algorithm is better than that of the second mixing algorithm.

S210: and outputting the mixed sound data after automatic gain control.

The loudness of the mixed sound data is adaptively adjusted by Automatic Gain Control (AGC), so that the volume of the output audio is stabilized, and the condition that the volume of the output audio is not large or small is improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a mixing apparatus according to the present invention. As shown in fig. 5, the mixing apparatus 30 includes a processor 31.

The processor 31 may also be referred to as a CPU (Central Processing Unit). The processor 31 may be an integrated circuit chip having signal processing capabilities. The processor 31 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The mixing apparatus 30 may further include a memory (not shown) for storing instructions and data required for the processor 31 to operate.

The processor 31 is configured to execute instructions to implement the method provided by any of the above-described embodiments of the mixing method of the present invention and any non-conflicting combinations.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium 40 of an embodiment of the present invention stores instructions that, when executed, implement the method provided by any embodiment of the mixing method of the present invention and any non-conflicting combinations. The instructions may form a program file stored in the storage medium in the form of a software product, so as to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims

1. A mixing method, comprising:

acquiring multi-channel audio data;

respectively judging whether each path of audio data participates in sound mixing by using at least a Gaussian model-based voice activity detection algorithm and a historical judgment result, wherein the historical judgment result is a judgment result of a previous frame;

selecting a mixing algorithm by using the number of the audio data participating in mixing;

and mixing the audio data participating in the audio mixing by adopting a selected audio mixing algorithm.

2. The method of claim 1,

the method for acquiring the multi-channel audio data comprises the following steps:

and denoising the audio frames of the multi-channel audio.

3. The method of claim 1,

the respectively judging whether each path of audio data participates in sound mixing by at least utilizing a Gaussian model-based voice activity detection algorithm and a historical judgment result comprises the following steps:

judging whether the audio data is voice by using the voice activity detection algorithm based on the Gaussian model to obtain a current judgment result;

and judging whether the audio data participates in sound mixing or not by combining the current judgment result and the historical judgment result.

4. The method of claim 3,

the judging whether the audio data is voice by using the voice activity detection algorithm based on the Gaussian model to obtain the current judgment result comprises the following steps:

dividing the audio data into a plurality of sub-bands according to frequency;

respectively calculating the energy-entropy ratio of each sub-band and the likelihood ratio of the sub-band being speech;

and judging whether the audio data is voice or not by combining the likelihood ratios and the energy-entropy ratios of all the sub-frequency bands to obtain the current judgment result.

5. The method of claim 4,

the determining whether the audio data is voice to obtain the current determination result by combining the likelihood ratios and the entropy ratios of all the sub-bands comprises:

if the number of the sub-frequency bands with the likelihood ratio larger than the first threshold is larger than a second threshold and the number of the sub-frequency bands with the energy-entropy ratio larger than a third threshold is larger than a fourth threshold, the current judgment result is voice, otherwise, the current judgment result is non-voice; or

If the number of the sub-bands with the likelihood ratio larger than the first threshold is larger than the second threshold or the number of the sub-bands with the entropy ratio larger than the third threshold is larger than the fourth threshold, the current judgment result is voice, otherwise, the current judgment result is non-voice.

6. The method of claim 3,

the historical decision result includes a previous decision result, the previous decision result is one of the historical decision results that is most adjacent to the current decision result, and the determining whether the audio data participates in audio mixing by combining the current decision result and the historical decision result includes:

if the current judgment result and the previous judgment result are both non-voice, the audio data does not participate in sound mixing;

if the current judgment result is non-voice and the previous judgment result is voice, multiplying the audio data by a first coefficient group in a time domain and participating in sound mixing, wherein the value range of the first coefficient group is 0-1, and the coefficients in the group become smaller in sequence;

if the current judgment result and the previous judgment result are both voices, the audio data participates in sound mixing;

and if the current judgment result is voice and the previous judgment result is non-voice, multiplying the audio data by a second coefficient group in the time domain and participating in sound mixing, wherein the value range of the second coefficient group is 0-1, and the coefficients in the group are sequentially increased.

7. The method of claim 3,

the historical decision result includes a previous decision result and a previous decision result, where the previous decision result and the previous decision result are two of the historical decision results that are most adjacent to the current decision result, and the determining whether the audio data participates in audio mixing by combining the current decision result and the historical decision result includes:

if the current judgment result, the previous judgment result and the previous second judgment result are non-voice, the audio data does not participate in sound mixing;

if the current decision result and the previous decision result are non-speech and the second decision result is speech, multiplying the audio data by a third coefficient group in a time domain and participating in sound mixing, wherein the value range of the third coefficient group is 0-0.5 and the coefficients in the group become smaller in sequence;

if the current judgment result is non-voice and the previous judgment result are voice, multiplying the audio data by a fourth coefficient group in a time domain and participating in sound mixing, wherein the value range of the fourth coefficient group is 0.5-1 and the coefficients in the group become smaller in sequence;

if the current judgment result, the previous judgment result and the previous second judgment result are all voices, the audio data participates in sound mixing;

if the current judgment result and the previous judgment result are voices and the second judgment result is non-voice, the audio data is multiplied by a fifth coefficient group in the time domain and participates in sound mixing, the value range of the fifth coefficient group is 0.5-1, and the coefficients in the group become larger in sequence;

and if the current judgment result is voice and the previous judgment result are non-voice, multiplying the audio data by a sixth coefficient group in the time domain and mixing the audio data, wherein the value range of the sixth coefficient group is 0 to 0.5, and the coefficients in the group become larger in sequence.

8. The method of claim 1,

the step of respectively judging whether each path of audio data participates in sound mixing by at least utilizing a Gaussian model-based voice activity detection algorithm and a historical judgment result further comprises the following steps:

and if the audio data comes from the appointed channel, the audio data participates in sound mixing.

9. The method of claim 1,

the algorithm for selecting a mixing using the number of the audio data participating in mixing includes:

if the number of the audio data participating in sound mixing is larger than a preset threshold value, selecting a first sound mixing algorithm to perform sound mixing; otherwise, a second mixing algorithm is selected, wherein the first mixing algorithm is more complex than the second mixing algorithm and the effect of the first mixing algorithm is better than that of the second mixing algorithm.

10. The method of claim 9,

the preset threshold is 5, the first sound mixing algorithm is a self-adaptive weighting method, and the second sound mixing algorithm is an expansion limiting method.

11. The method of claim 1,

the acquiring the multi-channel audio data further comprises:

and controlling the state of a sound source acquisition switch according to the current scene.

12. The method of claim 1,

after the audio data participating in audio mixing is mixed by adopting the selected audio mixing algorithm, the method further comprises the following steps:

and outputting the mixed sound data after automatic gain control.

13. An audio mixing apparatus, characterized in that the audio mixing apparatus comprises a processor,

the processor is configured to execute instructions to implement the method of any one of claims 1-12.

14. A storage medium storing instructions that, when executed, implement the method of any one of claims 1-12.