CN114944162A

CN114944162A - Audio processing method and device, electronic equipment and storage medium

Info

Publication number: CN114944162A
Application number: CN202210450469.XA
Authority: CN
Inventors: 周亮
Original assignee: Beijing Eswin Computing Technology Co Ltd; Haining Eswin IC Design Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd; Haining Eswin IC Design Co Ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-08-26

Abstract

The application provides an audio processing method, an audio processing device, an electronic device and a storage medium, wherein the audio processing method comprises the following steps: extracting an original center channel from audio having multiple channels; determining the existence probability of the voice of each segment in the original center sound channel; carrying out noise reduction processing on the original center sound channel to obtain a noise-reduced center sound channel; weighting the original center sound channel and the noise-reduced center sound channel based on the existence probability of the voice of each segment in the original center sound channel, wherein the higher the existence probability of the voice of the corresponding segment is, the higher the weight occupied by the corresponding segment in the noise-reduced center sound channel is during sound mixing; and mixing the weighted center channel with other channels except the original center channel in the audio to obtain the audio after mixing. The human voice in the center sound channel can be accurately enhanced, and the definition and intelligibility of the content in the audio can be improved to a greater extent.

Description

Audio processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to an audio processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of science and technology, more and more channels are available for people to receive external information, and one of the channels is to watch television programs. The method is limited by the difference of recording technology in the process of making the television program, so that when the television program is played, the mixture of the human voice and the background voice in the audio frequency is difficult to distinguish, and the watching experience of people is shown. Moreover, when people watch tv programs in noisy environments, it is difficult for people to hear the voices in the audio played by tv programs. Therefore, there is a need for enhancing the human voice in the audio played by television programs.

At present, in order to enhance human voice in audio, the method mainly adopts: since there are multiple channels in audio, first, a center channel is extracted from the multiple channels. Then, whether Voice exists in the center channel is judged through Voice Activity Detection (VAD) algorithm. If so, noise reduction processing is performed on the center channel by using a Dynamic Range Control (DRC), a filter, or the like. And then, the center channel after the noise reduction processing is mixed with other channels in the multi-channel sound, and further, the mixed audio is output for people to listen. If not, the noise reduction processing is not carried out on the center sound channel, and the original audio is directly played for people to listen.

However, when the above method is used to enhance the voice in the audio, the enhancement effect of the voice is not very ideal, and the definition and intelligibility of the speaking content in the audio are not high.

Disclosure of Invention

An object of the embodiments of the present application is to provide an audio processing method, an audio processing apparatus, an electronic device, and a storage medium, which further enhance human voice in audio and improve definition and intelligibility of speech content in audio.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

a first aspect of the present application provides an audio processing method, including: extracting an original center channel from audio having multiple channels; determining the existence probability of the voice of each segment in the original center sound channel; denoising the original center sound channel to obtain a denoised center sound channel; weighting the original center sound channel and the noise-reduced center sound channel based on the existence probability of the voice of each segment in the original center sound channel, wherein the higher the existence probability of the voice of the corresponding segment is, the higher the weight occupied by the corresponding segment in the noise-reduced center sound channel is during sound mixing; and mixing the weighted center channel with other channels except the original center channel in the audio to obtain the audio after sound mixing.

A second aspect of the present application provides an audio processing apparatus, the apparatus comprising: a signal separation unit for extracting an original center channel from audio having multiple channels; a voice activity detection unit for determining a speech presence probability of each segment in the original center channel; the voice noise reduction unit is used for carrying out noise reduction processing on the original center sound channel to obtain a noise-reduced center sound channel; the center channel processing unit is used for weighting the original center channel and the noise-reduced center channel based on the voice existence probability of each segment in the original center channel, wherein the higher the voice existence probability of the corresponding segment is, the larger the weight occupied by the corresponding segment in the noise-reduced center channel is during sound mixing; and the audio mixing unit is used for mixing the weighted center channel with other channels except the original center channel in the audio to obtain the audio after audio mixing.

A third aspect of the present application provides an electronic device comprising: a processor, memory, and a bus; wherein the processor and the memory communicate with each other via the bus; the processor is for invoking program instructions in the memory for performing the method of the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, where the program, when executed, controls an apparatus in which the storage medium is located to perform the method of the first aspect.

Compared with the prior art, the audio processing method provided in the first aspect of the present application extracts the original center channel from the audio, then determines the existence probability of the voice of each segment in the original center channel, and performs noise reduction processing on the original center channel to obtain the noise-reduced center channel, then weights the original center channel and the noise-reduced center channel based on the existence probability of the voice of each segment in the original center channel, where the higher the existence probability of the voice of the corresponding segment is, the greater the weight occupied by the corresponding segment in the noise-reduced center channel is, and finally performs audio mixing on the weighted center channel and other channels except the original center channel in the audio to obtain the audio after audio mixing. Therefore, the human voice in the center sound channel can be accurately enhanced, and the definition and intelligibility of the content in the audio can be improved to a greater extent.

The audio processing apparatus provided by the second aspect, the electronic device provided by the third aspect, and the computer-readable storage medium provided by the fourth aspect of the present application have the same or similar beneficial effects as the audio processing method provided by the first aspect.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 is a first flowchart illustrating an audio processing method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a second exemplary audio processing method according to an embodiment of the present disclosure;

FIG. 3 is a third flowchart illustrating an audio processing method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

At present, in order to enhance voice in audio, a voice activity detection VAD algorithm is used to determine whether voice exists in a center channel of the audio, and when it is determined that voice exists in the center channel, filtering processing is performed on the entire center channel, so as to highlight the voice. Then, the filtered center channel is mixed with other channels in the audio, so that the human voice enhancement in the audio is completed. However, the effect of enhancing human voice in audio is still not ideal. When the enhanced audio is played, the human voice and the background sound in the audio are still difficult to distinguish and the human voice is difficult to hear clearly, and further the degree of improving the definition and intelligibility of the content in the audio is not high.

The inventor has found, after intensive research, that the human voice in the audio is enhanced in the current way, and the main reason why the enhancement effect is not ideal is that: the signal of the center channel in the audio is not specifically processed further, that is, the signal of each segment of the center channel is not analyzed separately, and the signal is processed in different ways according to different analysis results.

In view of this, embodiments of the present application provide an audio processing method, an apparatus, an electronic device, and a storage medium, in which for a center channel in audio, a speech existence probability of each segment in the center channel is first determined. At the same time, the center channel is also subjected to noise reduction processing all at once. When the final center channel is obtained, for the segment with high speech existence probability, the center channel subjected to noise reduction processing is adopted as the corresponding segment, and for the segment with low speech existence probability, the center channel subjected to noise reduction processing is adopted as the corresponding segment. And finally, mixing the obtained center channel with other channels in the audio. The voice of the center sound channel can be effectively enhanced by combining the fragments processed in different modes in the center sound channel through the voice existence probability of each fragment in the center sound channel, so that the definition and intelligibility of the content in the audio are improved.

Next, the audio processing method provided in the embodiment of the present application will be described in detail first.

Fig. 1 is a first schematic flowchart of an audio processing method in an embodiment of the present application, and referring to fig. 1, the method may include:

s101: an original center channel is extracted from audio having multiple channels.

In a piece of audio, there are often multiple channels. For example: in a section of audio of the teacher lecture, there is a sound channel for the teacher to speak, a sound channel for the teacher to write on the blackboard, a sound channel for the student to turn over, and so on. In the audio, what people want to listen to is what the teacher speaks. Thus, the sound of the teacher speaking is the center channel of the audio, while the sound of the teacher writing on the blackboard and the sound of the student turning over the book are the other two channels in the audio.

In order to achieve more accurate processing of human voice in audio, it is necessary to extract a center channel, which may also be referred to as an original center channel, from audio of multiple channels.

Generally, in a piece of audio, there is often only one center channel. I.e. the corresponding channel whose channel emission position is closest to the audio collection means. When there is a case where the two channels are both emitted at positions closest to the audio collecting means, a prompt may be generated to prompt the user to select one channel as the center channel. Or according to preset rules, such as: the channel with the louder sound is used as the center channel, and the center channel is automatically determined. The specific way to extract the center channel from the audio is not limited herein.

S102: the probability of speech presence for each segment in the original center channel is determined.

After the original center channel is extracted from the audio, since the original center channel also belongs to a segment of the audio signal, there is a human voice in some time, i.e., a human is speaking, and there is no human voice in some time, i.e., no human is speaking. Therefore, in order to more accurately process the center channel, it is necessary to determine the existence probability of the speech of each segment in the original center channel, that is, the probability of speaking of a person in the corresponding segment.

In a specific implementation, the voice presence probability of each segment in the original center channel may be calculated by a voice activity detection VAD algorithm. For example: in the original center channel, the corresponding total duration is 3min, and the duration of every 1min is one segment, and the original center channel can be divided into 3 voice segments. The VAD algorithm can calculate that the voice existence probability in the segment of 0-1min is a, the voice existence probability in the segment of 1-2min is b, and the voice existence probability in the segment of 2-3min is c. The speech existence probabilities a, b and c may be the same or different, and need to be obtained based on actual calculation results. In addition, other algorithms may also be used to calculate the speech existence probability of each segment in the original center channel, and for a specific algorithm, the calculation is not limited herein as long as the speech existence probability can be calculated.

S103: and carrying out noise reduction processing on the original center sound channel to obtain a noise-reduced center sound channel.

After the original center channel is extracted from the audio, although the original center channel is mainly human voice, there still exist some interference sounds, and in order to enhance the human voice, it is also necessary to remove these interference sounds, that is, perform noise reduction processing on the original center channel to obtain a noise-reduced center channel.

In practical applications, various manners such as a dynamic range control DRC or various filters may be adopted to perform noise reduction processing on the original center channel, as long as the noise in the original center channel can be removed, and the specific denoising manner adopted in step S103 is not limited herein.

It should be noted that, the steps S102 and S103 may be performed simultaneously or not, and the order of the steps S102 and S103 in execution is not particularly limited here. In the subsequent step S104, it is necessary to use both steps S102 and S103.

S104: the original center channel and the noise-reduced center channel are weighted based on the speech existence probability of each segment in the original center channel.

The higher the speech existence probability of the corresponding segment is, the larger the weight occupied by the corresponding segment in the noise-reduced center channel is during sound mixing. The lower the speech existence probability of the corresponding segment is, the larger the weight occupied by the corresponding segment in the noise-reduced center channel during sound mixing is.

For example, assume that the original center channel is divided into 3 segments, which are: fragment a1, fragment b1, and fragment c 1. Through the above-described step S102, it is determined that the speech presence probability of the segment a1 is 90%, the speech presence probability of the segment b1 is 50%, and the speech presence probability of the segment c1 is 20%. Through the step S103, the segments corresponding to the noise-reduced center channel are: fragment a2, fragment b2, and fragment c 2. In this step, the original center channel and the noise-reduced center channel need to be mixed, and the mixing is performed to different degrees according to the existence probability of the speech of each segment, that is, weighting, and the weighted center channel may be: [ fragment a 1X 90% + fragment a 2X (1-90%) ] + [ fragment b 1X 50% + fragment b 2X (1-50%) ] + [ fragment c 1X 20% + fragment c 2X (1-20%) ].

Of course, if the speech existence probability of a segment is 100%, which indicates that there must be human voice in the segment, the segment of the corresponding center channel used by the segment is the corresponding segment in the center channel after denoising. And if the speech existence probability of a certain segment is 0%, which indicates that no human voice exists in the segment, the segment of the corresponding center channel used by the segment is the corresponding segment in the original center channel.

Therefore, the part with the human voice in the center sound channel can be denoised more accurately, the sound of the part without the human voice is reserved as far as possible, the accurate increase of the human voice is realized, and the definition and intelligibility of the content in the audio are improved.

S105: and mixing the weighted center channel with other channels except the original center channel in the audio to obtain the audio after mixing.

After the voice in the center channel of the audio is accurately enhanced, generally, the probability of the voice existing in other channels except the center channel in the audio is low, and even if the voice exists in other channels, the voice content in other channels is not as important as the content of the voice in the center channel, so that the weighted center channel can be directly mixed with other channels without accurately enhancing the voice of other channels.

Of course, the above steps S102 to S104 may be performed again for other channels, and then in this step, the weighted center channel and the weighted other channels may be mixed. Therefore, various human voices in the audio can be further enhanced, and the intelligibility of all contents in the audio is further improved.

The specific method used for mixing the weighted center channel with other channels may be various current mixing methods, and is not limited in this respect.

As can be seen from the above, in the audio processing method provided in this embodiment of the present application, an original center channel is extracted from an audio, then a voice existence probability of each segment in the original center channel is determined, and noise reduction processing is performed on the original center channel to obtain a noise-reduced center channel, then, the original center channel and the noise-reduced center channel are weighted based on the voice existence probability of each segment in the original center channel, where the higher the voice existence probability of the corresponding segment is, the greater the weight occupied by the corresponding segment in the noise-reduced center channel is when mixing audio, and finally, the weighted center channel and other channels except the original center channel in the audio are mixed audio to obtain the mixed audio. Therefore, the human voice in the center sound channel can be accurately enhanced, and the definition and intelligibility of the content in the audio can be improved to a greater extent.

Further, as a refinement and extension of the method shown in fig. 1, an embodiment of the present application further provides an audio processing method. Fig. 2 is a schematic flowchart of a second audio processing method in an embodiment of the present application, and referring to fig. 2, the method may include:

s201: an adaptive weighting algorithm is used to extract the original center channel from the audio with multiple channels.

When the original center channel is extracted from the audio, whether it is a binaural channel or a pair channel, can be extracted by using an Adaptive weighting (Adaptive Panning) algorithm. Therefore, the center sound channel can be more accurately extracted, and accurate enhancement of human voice in audio is facilitated. Of course, other algorithms may be used as long as the center channel can be extracted from the audio, and the method is not particularly limited herein.

S202: the probability of speech presence for each segment in the original center channel is determined.

Here, step S202 is the same as or similar to the specific implementation of step S102, and therefore is not described herein again.

S203: and smoothing the existence probability of the voice of each segment in the original center sound channel.

After the original center channel is divided into segments and the existence probabilities of voices corresponding to the segments are determined, if the difference between the existence probabilities of voices of adjacent segments is large, when the corresponding segments in the original center channel and the noise-reduced center channel are weighted, the center channel after weighting may have a discontinuous voice condition, so that the quality of the processed audio is reduced. Therefore, after determining the existence probability of voices of each segment in the original center channel, smoothing processing needs to be performed on the existence probabilities of the voices.

For example, assume that the original center channel is divided into 5 segments, respectively: the corresponding speech existence probabilities of the segment a, the segment b, the segment c, the segment d and the segment e are respectively as follows: 50%, 60%, 90%, 20%, 70%. If the original center channel and the corresponding segment of the noise-reduced center channel are weighted according to the probability, then the weighted segments are spliced together to form an audio frequency, which has the problem of unsmooth playing, i.e. intermittent sound. Therefore, in this case, it is necessary to smooth the speech existence probability corresponding to each segment. In audio playing, the existing probability of the speech of each segment in the original center channel is changed from 50% to 60% and then to 90%, and in the second change process, the probability change is large and needs to be smoothed, so that 60% is adjusted to 70%. Then, the speech existence probability is changed from 90% to 20% and then to 70%, and in the second change process, the probability change is large and smoothing is required, so that 20% is adjusted to 50%. Finally, the speech existence probability of each segment is changed from 50%, 60%, 90%, 20%, 70% to 50%, 70%, 90%, 50%, 70% after smoothing.

Of course, the specific numerical values above are merely for illustrative clarity. For the specific smoothing mode of the speech existence probability of each segment, various smoothing algorithms can be adopted for processing, and the obtained result after smoothing processing has all differences, which is not specifically limited herein.

S204: and performing noise reduction processing on the original center sound channel by adopting a speech enhancement algorithm based on short-time logarithmic spectrum estimation to obtain a noise-reduced center sound channel.

In the process of performing noise reduction processing on the original center channel, in order to increase the noise reduction processing speed on the original center channel and further increase the processing efficiency of the audio, a speech enhancement algorithm based on short-time log spectrum estimation may be specifically adopted to perform noise reduction processing on the original center channel. Compared with other noise reduction algorithms, the convergence speed of the speech enhancement algorithm based on short-time log-spectrum estimation is higher in the process of noise reduction processing of the sound channel, so that noise reduction of the original center sound channel can be completed faster, the noise-reduced center sound channel can be obtained faster, and the processing efficiency of the audio frequency is further improved.

Here, the steps S202 to S203 and the present step S204 may be executed at the same time or at different times. The sequence of the steps S202 to S203 and the step S204 is not limited specifically, and the following step S205 may be executed continuously only after the steps S203 and S204 are completed.

S205: the original center channel and the noise-reduced center channel are weighted based on the speech existence probability of each segment in the original center channel.

In weighting the two center channels, the weighting may be performed in the specific manner described in the foregoing step S104.

In order to further increase the weighting processing speed, when the probability of existence of the first voice corresponding to the first segment in the original center channel is higher than 50% and higher than the probability of existence of the second voice corresponding to the second segment, the weighting processing process may also be simplified, that is, in the finally formed center channel, each segment adopts either the corresponding segment in the original center channel or the corresponding segment in the center channel after noise reduction. Specifically, which channel segment is used, needs to be determined according to the existence probability of the voice.

For example, assume that the first segment in the original center channel corresponds to a first speech existence probability of 80% and the second segment corresponds to a second speech existence probability of 30%. This indicates that the probability of the presence of the human voice in the first segment is high and the probability of the presence of the human voice in the second segment is low. At this time, in order to complete the weighting process as soon as possible, the weighting calculation process may be simplified, and the segment corresponding to the first segment in the noise-reduced center channel may be directly used to be spliced with the second segment in the original center channel.

Therefore, on the basis of ensuring the enhancement of the human voice, the weighting calculation speed can be increased, and the audio processing efficiency is further improved.

S206: and smoothing the weighted center channel.

The weighted center channel is obtained by weighting the original center channel and the noise-reduced center channel based on the speech existence probability of each segment in the original center channel. In order to make the audio frequency after final processing smoother during playing and avoid discontinuity, the weighted center channel may be smoothed again.

In a specific implementation, the weighted center channel may be smoothed by any one or more smoothing methods. The specific smoothing method used in this step is not limited here.

After the weighted center sound channel is subjected to smoothing processing, the processed center sound channel is smoother and has no discontinuity, so that the center sound channel can be prevented from causing the final audio to be discontinuous, and the smoothness of the final output audio is improved.

S207: and mixing the smoothed center channel with other channels except the original center channel in the audio to obtain the audio after sound mixing.

Here, step S207 is the same as or similar to the specific implementation of step S105, and therefore is not described herein again.

In practical applications, audio is often output by a multi-channel/stereo system, that is, an audio device, and the multi-channel/stereo system divides the audio into a left channel (L) and a right channel (R) to be output respectively, so that the audio processing method provided by the embodiment of the application extracts a center channel from the left channel and the right channel, then determines the speech existence probability and reduces noise for the extracted center channel, and then performs sound mixing with the left channel and the right channel respectively, and finally outputs the processed left channel and the processed right channel, that is, the processed audio respectively.

Fig. 3 is a third schematic flowchart of an audio processing method in an embodiment of the present application, and referring to fig. 3, after a left channel (L) and a right channel (R) are output by a multi-channel/stereo system, first, a center channel is extracted from the left channel (L) and the right channel (R), and a center channel (C) is extracted. Then, voice denoising is carried out on the center sound channel (C) to obtain a denoised center sound channel (C'), and voice endpoint detection is carried out on the center sound channel (C). And finally, respectively mixing the noise-reduced center channel (C ') and left channel (L) and the noise-reduced center channel (C') and right channel (R) based on the voice endpoint detection result, and finally respectively outputting the processed left channel (L ') and right channel (R'), so that the human voice in the audio is enhanced, and the definition and intelligibility of the audio content are improved.

Based on the same inventive concept, as an implementation of the method, the embodiment of the application further provides an audio processing device. Fig. 4 is a schematic structural diagram of an audio processing apparatus in an embodiment of the present application, and referring to fig. 4, the apparatus may include:

a signal separation unit 401 for extracting an original center channel from audio having multiple channels;

a voice activity detection unit 402 for determining a speech presence probability for each segment in the original center channel;

a voice denoising unit 403, configured to perform denoising processing on the original center channel to obtain a denoised center channel;

a center channel processing unit 404, configured to weight the original center channel and the noise-reduced center channel based on a voice existence probability of each segment in the original center channel, where the higher the voice existence probability of a corresponding segment is, the larger the weight occupied by the corresponding segment in the noise-reduced center channel is when mixing sound;

and a mixing unit 405, configured to mix the weighted center channel with other channels in the audio except for the original center channel to obtain a mixed audio.

In other embodiments of the present application, the speech denoising unit 403 is specifically configured to perform denoising processing on the original center channel by using a speech enhancement algorithm based on short-time log spectrum estimation.

In other embodiments of the present application, the voice activity detection unit 402 is further configured to smooth the existence probability of the speech of each segment in the original center channel; the center channel processing unit 404 is specifically configured to weight the original center channel and the noise-reduced center channel based on the existence probability of the speech of each segment in the original center channel after the smoothing processing.

In other embodiments of the present application, the original center channel includes a first segment and a second segment; the existence probability of the first voice corresponding to the first segment is higher than 50%, and is higher than the existence probability of the second voice corresponding to the second segment; the center channel processing unit 404 is specifically configured to splice the first segment and the second segment based on a first speech existence probability of the first segment and a second speech existence probability of the second segment.

In other embodiments of the present application, the mixing unit 405 is further configured to perform a smoothing process on the weighted center channel; and mixing the smoothed center channel with other channels except the original center channel in the audio.

In other embodiments of the present application, the signal separation unit 401 is specifically configured to extract an original center channel from audio with multiple channels by using an adaptive weighting algorithm.

It is to be noted here that the above description of the embodiments of the apparatus, similar to the description of the embodiments of the method described above, has similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

Based on the same inventive concept, the embodiment of the application also provides the electronic equipment. Fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application, and referring to fig. 5, the electronic device may include: a processor 501, a memory 502, and a bus 503; the processor 501 and the memory 502 complete communication with each other through a bus 503; the processor 501 is used to call program instructions in the memory 502 to perform the methods in one or more of the embodiments described above.

It is to be noted here that the above description of the embodiments of the electronic device, similar to the description of the embodiments of the method described above, has similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the electronic device of the present application, refer to the description of the embodiments of the method of the present application for understanding.

Based on the same inventive concept, the present application further provides a computer-readable storage medium, on which a computer program is stored, wherein when the program runs, the apparatus on which the storage medium is located is controlled to perform the method in one or more of the above embodiments.

It is to be noted here that the above description of the storage medium embodiments, like the description of the above method embodiments, has similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of audio processing, the method comprising:

extracting an original center channel from audio having multiple channels;

determining the existence probability of the voice of each segment in the original center sound channel;

denoising the original center sound channel to obtain a denoised center sound channel;

weighting the original center sound channel and the noise-reduced center sound channel based on the existence probability of the voice of each segment in the original center sound channel, wherein the higher the existence probability of the voice of the corresponding segment is, the higher the weight occupied by the corresponding segment in the noise-reduced center sound channel is during sound mixing;

and mixing the weighted center channel with other channels except the original center channel in the audio to obtain the audio after sound mixing.

2. The method of claim 1, wherein the denoising the original center channel comprises:

and carrying out noise reduction processing on the original center sound channel by adopting a speech enhancement algorithm based on short-time logarithmic spectrum estimation.

3. The method of claim 1, wherein after said determining the probability of speech presence for each segment in the original center channel, the method further comprises:

carrying out smoothing treatment on the voice existence probability of each segment in the original center sound channel;

weighting the original center channel and the noise-reduced center channel based on the speech existence probability of each segment in the original center channel, including:

and weighting the original center channel and the noise-reduced center channel based on the voice existence probability of each segment in the original center channel after the smoothing processing.

4. The method of claim 1, wherein the original center channel comprises a first segment and a second segment; the existence probability of the first voice corresponding to the first segment is higher than 50%, and is higher than the existence probability of the second voice corresponding to the second segment; weighting the original center channel and the noise-reduced center channel based on the speech existence probability of each segment in the original center channel, including:

splicing the first segment with the second segment based on a first voice existence probability of the first segment and a second voice existence probability of the second segment.

5. The method of claim 1, wherein after weighting the original center channel and the denoised center channel based on the probabilities of speech presence for respective segments in the original center channel, the method further comprises:

smoothing the weighted center channel;

mixing the weighted center channel with other channels of the audio except the original center channel, including:

and mixing the smoothed center channel with other channels except the original center channel in the audio.

6. The method according to any one of claims 1 to 5, wherein the extracting an original center channel from audio having multiple channels comprises:

an adaptive weighting algorithm is used to extract the original center channel from the audio with multiple channels.

7. An audio processing apparatus, characterized in that the apparatus comprises:

a signal separation unit for extracting an original center channel from audio having multiple channels;

a voice activity detection unit for determining a speech presence probability of each segment in the original center channel;

the voice noise reduction unit is used for carrying out noise reduction processing on the original center sound channel to obtain a noise-reduced center sound channel;

a center channel processing unit, configured to weight the original center channel and the noise-reduced center channel based on a voice existence probability of each segment in the original center channel, where the higher the voice existence probability of a corresponding segment is, the larger the weight occupied by the corresponding segment in the noise-reduced center channel is when mixing the voice;

and the sound mixing unit is used for mixing the weighted center channel with other channels except the original center channel in the audio to obtain the audio after sound mixing.

8. The apparatus according to claim 7, wherein the speech denoising unit is specifically configured to denoise the original center channel using a speech enhancement algorithm based on short-time log spectral estimation.

9. An electronic device, characterized in that the electronic device comprises: a processor, a memory, and a bus; wherein, the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 6.

10. A computer-readable storage medium, having a computer program stored thereon, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the method of any of claims 1-6.