CN114550748A - Audio signal mixing processing method, device, equipment and storage medium - Google Patents

Audio signal mixing processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN114550748A
CN114550748A CN202210039333.XA CN202210039333A CN114550748A CN 114550748 A CN114550748 A CN 114550748A CN 202210039333 A CN202210039333 A CN 202210039333A CN 114550748 A CN114550748 A CN 114550748A
Authority
CN
China
Prior art keywords
audio signal
voice
audio
mixing processing
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210039333.XA
Other languages
Chinese (zh)
Inventor
方兵晓
张帆
刘梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bigo Technology Singapore Pte Ltd
Original Assignee
Bigo Technology Singapore Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bigo Technology Singapore Pte Ltd filed Critical Bigo Technology Singapore Pte Ltd
Priority to CN202210039333.XA priority Critical patent/CN114550748A/en
Publication of CN114550748A publication Critical patent/CN114550748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved

Abstract

The embodiment of the invention discloses an audio signal mixing processing method, an audio signal mixing processing device, audio signal mixing processing equipment and a storage medium, wherein the method comprises the following steps: receiving each path of audio signals containing voice endpoint detection results; determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal; and in the process of mixing the audio signals, carrying out sound mixing processing on the audio signals corresponding to the voice scene and the non-voice scene through different sound mixing processing algorithms. The scheme improves the listening experience of a user and reduces the processing complexity of subsequent audio signals.

Description

Audio signal mixing processing method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of audio processing, in particular to an audio signal mixing processing method, device, equipment and storage medium.
Background
With the development of communication technology, online voice chat or multi-person video conference has become one of the important ways for people to exchange information, chat and work. In a multi-person conference or chat scene, the sound mixing processing technology is a technology for mixing acquired channel signals sent by a plurality of audio sending ends into one-path signal, the channel signals are played through a loudspeaker or an earphone and are heard by a listener at one side of an audio receiving end, and the technology reduces a plurality of paths of signals into one-path signal and then plays the signal for a user to listen.
In the prior art, when audio signals are mixed, a fixed mixing algorithm is usually adopted to mix multiple audio signals. However, this approach ignores the complexity of the multi-talk mixing technique, and noise signals and voice signals randomly appear in each path signal, thereby causing a problem that the noise content after mixing increases or the amplitude of the voice signal of the speech frame decreases. The quality of audio signals after sound mixing is difficult to guarantee, and meanwhile, the difficulty is increased for a subsequent processing algorithm, for example, the proportion of echo cancellation double talk is increased, and the prior signal-to-noise ratio during noise suppression is reduced, so that the difficulty of subsequent algorithm processing is improved, and the problem of reduction of audiences listening comfort is caused.
Disclosure of Invention
The embodiment of the invention provides an audio signal mixing processing method, an audio signal mixing processing device, audio signal mixing processing equipment and a storage medium, solves the problems that in the prior art, the flexibility is poor during audio signal mixing processing, the obtained mixed audio signal is high in noise, and the voice sound is not obvious, improves the listening experience of a user, and reduces the processing complexity of subsequent audio signals.
In a first aspect, an embodiment of the present invention provides an audio signal mixing processing method, where the method includes:
receiving each path of audio signals containing voice endpoint detection results;
determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal;
and in the process of mixing the audio signals, carrying out sound mixing processing on the audio signals corresponding to the voice scene and the non-voice scene through different sound mixing processing algorithms.
In a second aspect, an embodiment of the present invention further provides another audio signal mixing processing method, where the method includes:
performing framing processing on the audio signals to be mixed;
carrying out feature extraction on the audio signal subjected to framing processing, carrying out voice endpoint detection according to a feature extraction result, and generating a voice endpoint detection result containing identification information;
and sending the audio signal containing the voice endpoint detection result for audio signal mixing processing.
In a third aspect, an embodiment of the present invention further provides an audio signal mixing processing apparatus, including:
the audio signal receiving module is used for receiving each path of audio signals containing voice endpoint detection results;
the scene judgment module is used for determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each path of audio signal;
and the audio mixing processing module is used for carrying out audio mixing processing on the audio scene and each path of audio signals corresponding to the non-audio scene through different audio mixing processing algorithms in the audio signal mixing processing process.
In a fourth aspect, an embodiment of the present invention further provides another audio signal mixing processing apparatus, including:
the audio framing module is used for framing the audio signals to be mixed;
the voice endpoint detection module is used for extracting the characteristics of the audio signal after the framing processing, detecting the voice endpoint according to the characteristic extraction result and generating a voice endpoint detection result containing identification information;
and the audio signal sending module is used for sending the audio signal containing the voice endpoint detection result and mixing the audio signal.
In a fifth aspect, an embodiment of the present invention further provides an audio signal mixing processing apparatus, where the apparatus includes:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the audio signal mixing processing method according to the embodiment of the present invention.
In a sixth aspect, the present invention further provides a storage medium storing computer-executable instructions, which are used to execute the audio signal mixing processing method according to the present invention when executed by a computer processor.
In the embodiment of the invention, each path of audio signal containing the voice endpoint detection result is received, the voice scene and the non-voice scene in the mixed audio signal are determined according to the voice endpoint detection result of each path of audio signal, and in the process of mixing the audio signals, each path of audio signal corresponding to the voice scene and the non-voice scene is subjected to audio mixing processing through different audio mixing processing algorithms. The problems that in the prior art, the flexibility is poor during audio signal mixing processing, the obtained mixed audio signal is high in noise, and voice sound is not obvious are solved, the listening experience of a user is improved, and the processing complexity of subsequent audio signals is reduced.
Drawings
Fig. 1 is a flowchart of an audio signal mixing processing method according to an embodiment of the present invention;
fig. 2 is a flowchart of another audio signal mixing processing method according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating a scene of a mixed audio signal determined during a multi-channel audio signal mixing process according to an embodiment of the present invention;
fig. 4 is a flowchart of another audio signal mixing processing method according to an embodiment of the present invention;
fig. 5 is a flowchart of a method for determining a channel signal to be mixed according to a voice endpoint detection result of each channel of audio signals according to an embodiment of the present invention;
fig. 6 is a flowchart of another audio signal mixing processing method according to an embodiment of the present invention;
fig. 7 is a block diagram of an audio signal mixing processing apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram of another audio signal mixing and processing apparatus according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an audio signal mixing processing device according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and do not delimit the embodiments. It should be further noted that, for convenience of description, only some structures, not all structures, relating to the embodiments of the present invention are shown in the drawings.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
Fig. 1 is a flowchart of an audio signal mixing processing method according to an embodiment of the present invention, which can be applied to perform audio mixing processing on multiple audio signals to obtain one audio signal, and the method can be executed by an audio receiving device such as a server, an intelligent terminal, a notebook, a tablet computer, and a voice/video conference device, and specifically includes the following steps:
step S101, receiving each audio signal containing the voice endpoint detection result.
The voice endpoint detection result is a result of detecting whether voice exists in the audio signal. Such as the common VAD (Voice activity detector) detection. The voice frame and the non-voice frame in the audio signal are marked by performing VAD detection on the audio signal, so that a voice endpoint detection result is obtained. Alternatively, the audio signal for speech frames may be labeled VAD ═ 1, and the audio signal for non-speech frames, i.e., noise, may be labeled VAD ═ 0.
In one embodiment, when the audio signal sending end sends the audio signal, the audio signal sending end performs voice endpoint detection on the audio signal to generate an audio signal containing a voice endpoint detection result to send to the audio receiving end. The audio receiving end receives each audio signal containing the voice endpoint detection result.
And S102, determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal.
In one embodiment, when the audio receiving end performs mixing processing on each audio signal, a voice scene and a non-voice scene in the mixed audio signal are determined according to a voice endpoint detection result of each audio signal. When the audio scene is mixed with each audio signal, at least one audio signal including the audio signal exists, that is, the mixed audio signal includes voice. The non-voice scene means that when the audio signals of each path are mixed, the audio signals of each path do not contain voice, namely, the audio signals obtained by mixing have no voice information.
In one embodiment, the voice endpoint detection result includes a voice identification for identifying a voice portion and a non-voice portion in the audio signal. And when determining the voice scene and the non-voice scene in the mixed audio signal, determining the voice scene and the non-voice scene in the mixed audio signal according to the voice identifiers in the audio signals at the same moment. Optionally, the same time may be a time corresponding to each frame in the audio signal, that is, it is determined whether the time corresponding to the mixed audio signal is a speech scene or a non-speech scene with respect to whether each frame in each path of audio signals corresponds to a speech frame or a non-speech frame at the same time. Illustratively, for 5 audio signals with a time length of 10 seconds, in the 1 st second to the 3 rd second, audio frames of corresponding moments in the 5 audio signals are all non-speech frames, and then a scene of a mixed audio signal of each audio signal corresponding to the 1 st second to the 3 rd second is a non-speech scene; and in the 3 rd to 10 th seconds, assuming that the 2 nd, 4 th and 5 th paths contain voice frames, the corresponding scenes of the mixed audio signals in the 3 rd to 10 th seconds are voice scenes. That is, when determining a speech scene and a non-speech scene in a mixed audio signal according to speech identifiers in audio signals at the same time, determining a part of the audio signals having speech at the same time as the speech scene in the mixed audio signal, and determining a part of the audio signals having non-speech at the same time as the non-speech scene in the mixed audio signal.
And step S103, in the process of audio signal mixing processing, performing audio mixing processing on each path of audio signals corresponding to the voice scene and the non-voice scene through different audio mixing processing algorithms.
The audio mixing processing algorithm is an algorithm for mixing each channel of audio signals. In the scheme, the fixed sound mixing processing algorithm is not adopted to mix each channel of audio signals, but different sound mixing processing algorithms are adopted to mix sound in the audio signal mixing processing process.
Specifically, in the process of mixing the audio signals, the scenes corresponding to the mixed audio signals to be mixed are first distinguished, and as described above, the voice scene and the non-voice scene are determined, and different audio mixing algorithms are adopted based on different voice scenes.
Specifically, aiming at a voice scene, a signal amplitude holding algorithm is adopted to perform sound mixing processing on each path of audio signals corresponding to the voice scene; and aiming at the non-voice scene, performing sound mixing processing on each path of audio signal corresponding to the non-voice scene by adopting a multi-path signal summation and averaging algorithm.
According to the scheme, the audio signals containing the voice endpoint detection results are received, the voice scene and the non-voice scene in the mixed audio signals are determined according to the voice endpoint detection results of the audio signals, in the audio signal mixing processing process, the audio signals corresponding to the voice scene and the non-voice scene are subjected to audio mixing processing through different audio mixing processing algorithms, the problem that multiple complex scenes cannot be considered in an adaptive mode due to the fact that the same audio mixing processing algorithm is used is avoided, the flexibility of audio mixing processing is improved, and the listening experience of a user can be remarkably improved.
On the basis of the technical scheme, determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each path of audio signal comprises the following steps: and determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of the audio packet of the audio signal corresponding to the same time interval. In one embodiment, the audio sending end sends the audio signal in the form of an audio packet, where the audio packet may be an audio packet composed of audio data with a preset frame number or a preset time length. When an audio receiving end determines a mixed audio signal scene, respectively determining that audio packets corresponding to the same time in each path of audio signals are voice audio packets or non-voice audio packets, and if all the audio packets corresponding to each path of audio signals are non-voice audio packets, determining the scene of the mixed audio signals as a non-voice scene; and if the voice audio packets exist in each path of audio signal, determining the scene of the mixed audio signal as a voice scene.
Fig. 2 is a flowchart of another audio signal mixing processing method according to an embodiment of the present invention, which provides a specific method for performing audio mixing processing on audio signals corresponding to a speech scene and a non-speech scene through different audio mixing processing algorithms, and as shown in fig. 2, the method specifically includes:
step S201, receiving each audio signal including the voice endpoint detection result.
Step S202, determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal.
Fig. 3 is a schematic diagram illustrating a scene of a mixed audio signal determined during a multi-channel audio signal mixing process according to an embodiment of the present invention. As shown in fig. 3, the audio signal processing apparatus includes 3 audio signals, which are respectively denoted as a first audio signal, a second audio signal, and a third audio signal. Each path of audio signal comprises a voice endpoint detection result, the voice segment and the non-voice segment of each path of audio signal are exemplarily marked based on the voice endpoint detection result, and finally determined scene marks of the mixed audio signal are carried out.
Step S203, in the process of audio signal mixing processing, performing audio mixing processing on each path of audio signal corresponding to the voice scene through a signal amplitude maintaining algorithm, and performing audio mixing processing on each path of audio signal corresponding to the non-voice scene through a multi-path signal summation and averaging algorithm.
Illustratively, for the speech scene of the mixed audio signal labeled in fig. 3, optionally, the mixing process is performed by a signal amplitude preservation algorithm; for the non-speech scene of the mixed audio signal labeled in fig. 3, the audio signals of each channel corresponding to the non-speech scene are subjected to audio mixing processing by a multi-channel signal summation average algorithm.
Specifically, the signal amplitude holding algorithm may be an AGW mixing algorithm, that is, before mixing is calculated for multiple audio signals, the mixed signal X of all the multiple signals except for the audio signal is calculated based on each audio signal, then the maximum value D of the absolute values of the mixed signals is calculated, and then the maximum value K of the absolute values of the sampling points of all the channel signals is calculated, wherein D/K is multiplied by a fixed empirical value mu to serve as a weighting factor, and each audio signal X to be mixed is weighted to serve as a final result, which ensures the amplitude of the signal, can reasonably and effectively hold the speech information, but has no attenuation effect on the noise data of the non-speech information frame. The multi-channel signal summing and averaging algorithm can be an AAW mixing algorithm, the algorithm sums and averages the multi-channel signals, the method has the advantages that the calculation is simple, meanwhile, the noise suppression effect is achieved on noise data, the voice amplitude is reduced due to the fact that simple summing and averaging is conducted on voice information with speech, and when the phases of the multi-channel voice are opposite under extreme conditions, the voice signals can be offset, and therefore the voice data cannot be heard completely.
According to the scheme, in the process of audio signal mixing processing, the audio mixing processing is carried out on each path of audio signals corresponding to the voice scene through the signal amplitude maintaining algorithm, the audio mixing processing is carried out on each path of audio signals corresponding to the non-voice scene through the multi-path signal summation averaging algorithm, the noise part in the mixed audio signals is effectively reduced, unnecessary attenuation is not generated on the voice part, the finally obtained mixed audio signals are good in listening experience, the voice part can be clearly heard, and the noise part can be reasonably restrained.
On the basis of the above technical solution, when performing audio mixing processing on the audio signals corresponding to the speech scene and the non-speech scene through different audio mixing processing algorithms, the method further includes: and when the sound mixing processing mode is switched, smoothing processing is carried out on each audio signal. When the mixed audio signal is switched to represent that the mixed audio signal is switched from voice to non-voice or from non-voice to voice, and when the voice and the non-voice are interchanged, the smoothing processing technology is used for smoothing the mixed audio signal, for example, for the conversion from voice to non-voice, the smoothing processing of signal gradual-out is carried out on the voice signal part to avoid the sudden interruption of voice; and aiming at the conversion from non-voice to voice, smoothing processing of signal gradual-in is carried out, so that the voice is not abrupt.
On the basis of the technical scheme, when different mixing processing algorithms are switched, the method comprises the step of judging whether the switching is carried out, and if the mixed audio signal is detected to be changed from a voice scene to a non-voice scene or from the non-voice scene to the voice scene, the mixed audio signal is judged to meet the switching condition of the mixing processing algorithms.
Fig. 4 is a flowchart of another audio signal mixing processing method according to an embodiment of the present invention, and shows a method for selecting a channel signal to be mixed when mixing each channel of audio signal, as shown in fig. 4, the method specifically includes:
step S301, receiving each audio signal including the voice endpoint detection result.
Step S302, determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal.
Step S303, in the audio signal mixing process, determining the channel signal to be mixed according to the voice endpoint detection result of each channel of audio signal.
In one embodiment, when mixing processing is performed on multiple audio signals, not each audio signal is mixed, but a channel signal to be mixed is selected. It can determine the channel signal to be mixed according to the voice end point detection result of each channel of audio signal.
Specifically, fig. 5 is a flowchart of a method for determining a path signal to be mixed according to a voice endpoint detection result of each path of audio signal according to an embodiment of the present invention, as shown in fig. 5, specifically including:
step S3031, determining a path signal including a voice according to a voice endpoint detection result of each path of audio signal.
Step 3032, determining the path signal containing the voice as the path signal to be mixed.
For example, assuming that 8 channels of audio signals need to be mixed, determining whether each channel of audio signal is a channel signal containing voice during the mixing process, and if 3 channels of audio signals are determined to be channel signals containing voice, determining the 3 channels of channel signals containing voice as channel signals to be mixed.
In another embodiment, if there is a limit to the number of pass signals for audio mixing, the individual pass audio signals are filtered according to the limited number as an upper limit to the number of pass signals for audio mixing. Optionally, the path signal containing voice is preferentially screened, and if the path signal containing voice is greater than the upper limit, the path signal with the signal strength ranked earlier in the path signal containing voice is determined as the path signal to be mixed.
And step S304, performing sound mixing processing on the audio signals corresponding to the to-be-mixed channel signals corresponding to the voice scene and the non-voice scene through different sound mixing processing algorithms.
Therefore, when the audio signal is subjected to audio mixing processing, the step of selecting the channel signal is included, so that the efficiency and the listening effect of the audio signal audio mixing processing are improved, and the audio mixing processing mechanism is further optimized.
Fig. 6 is a flowchart of another audio signal mixing processing method according to an embodiment of the present invention, where the audio signal mixing processing method may be executed by an audio sending end device, such as a mobile phone, a notebook, a tablet computer, and a conference system device, and specifically includes:
step S401, performing framing processing on the audio signal to be mixed.
The framing processing refers to slicing and framing unstable audio data streams of the audio signals, and stable signals are obtained through audio framing processing so as to facilitate time-frequency change analysis.
And S402, performing feature extraction on the audio signal subjected to the framing processing, performing voice endpoint detection according to a feature extraction result, and generating a voice endpoint detection result containing identification information.
In one embodiment, performing feature extraction on the framed audio signal includes: and carrying out normalized Mel spectral feature extraction on the audio signal subjected to framing processing, wherein the Mel spectral feature converts the frequency perception of human ears into a linear relation according to the auditory perception characteristic of human beings, so that the attention to low-frequency signals is increased.
Correspondingly, the voice endpoint detection is carried out according to the feature extraction result, and the method comprises the following steps: and inputting the Mel spectral feature extraction result to a scene distinguishing model obtained by pre-training and outputting a voice endpoint detection result. It takes advantage of the strong resolution of neural networks for voice endpoint detection. Optionally, the neural network implemented by using gated dilation convolution predicts whether the current frame contains a speech signal in a network structure combined with a deeper recursive network.
Step S403, sending the audio signal containing the voice endpoint detection result for audio signal mixing processing.
Therefore, at the audio signal sending end, the audio signal to be mixed is subjected to framing processing, the audio signal subjected to framing processing is subjected to feature extraction, voice endpoint detection is performed according to the feature extraction result, a voice endpoint detection result containing identification information is generated, the audio signal containing the voice endpoint detection result is sent for audio signal mixing processing, and by performing the voice endpoint detection at the audio sending end, the problems that the generation of mixed audio signals is delayed and the calculation amount is large when a multi-channel audio signal is subjected to the voice endpoint detection at the audio receiving end are solved. Meanwhile, the audio signal sending end sends the audio signal containing the voice endpoint detection result, so that the overall audio signal mixing mechanism is reasonably optimized, the audio receiving end can distinguish scenes conveniently, the auditory experience of a user is finally improved, and the processing complexity of the subsequent audio signal is reduced.
Fig. 7 is a block diagram of an audio signal mixing processing apparatus according to an embodiment of the present invention, which is used for performing a part of the audio signal mixing processing method provided in the foregoing embodiment, and has corresponding functional modules and beneficial effects of the performing method. As shown in fig. 7, the apparatus specifically includes: an audio signal receiving module 101, a scene judging module 102 and a mixing processing module 103, wherein,
an audio signal receiving module 101, configured to receive each channel of audio signals including a voice endpoint detection result;
the scene judgment module 102 is configured to determine a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal;
and the audio mixing processing module 103 is configured to perform audio mixing processing on the audio signals corresponding to the speech scene and the non-speech scene through different audio mixing processing algorithms in an audio signal mixing processing process.
According to the scheme, the audio signals containing the voice endpoint detection results are received, the voice scene and the non-voice scene in the mixed audio signals are determined according to the voice endpoint detection results of the audio signals, in the audio signal mixing processing process, the audio signals corresponding to the voice scene and the non-voice scene are subjected to audio mixing processing through different audio mixing processing algorithms, the problem that multiple complex scenes cannot be considered in an adaptive mode due to the fact that the same audio mixing processing algorithm is used is avoided, the flexibility of audio mixing processing is improved, and the listening experience of a user can be remarkably improved.
In a possible embodiment, the voice endpoint detection result includes a voice identifier, where the voice identifier is used to identify a voice part and a non-voice part in an audio signal, and the scene determination module 102 is specifically configured to:
and determining a voice scene and a non-voice scene in the mixed audio signal according to the voice identifiers in the audio signals at the same moment.
In a possible embodiment, the scenario determination module 102 is specifically configured to:
and determining the part of each audio signal with voice at the same time as the voice scene in the mixed audio signal, and determining the part of each audio signal with non-voice at the same time as the non-voice scene in the mixed audio signal.
In a possible embodiment, the mixing processing module 103 is specifically configured to:
performing sound mixing processing on each path of audio signal corresponding to the voice scene through a signal amplitude holding algorithm; and performing sound mixing processing on each path of audio signal corresponding to the non-voice scene through a multi-path signal summation average algorithm.
In a possible embodiment, the mixing processing module 103 is specifically configured to:
and when the sound mixing processing mode is switched, smoothing processing is carried out on each audio signal.
In a possible embodiment, the mixing processing module 103 is specifically configured to:
and determining a path signal to be mixed according to the voice endpoint detection result of each path of audio signal, and performing audio mixing processing on the audio signals corresponding to the path signal to be mixed corresponding to the voice scene and the non-voice scene through different audio mixing processing algorithms.
In a possible embodiment, the mixing processing module 103 is specifically configured to:
determining a channel signal containing voice according to the voice endpoint detection result of each channel of audio signal;
the path signal containing the voice is determined as the path signal to be mixed.
Fig. 8 is a block diagram of another audio signal mixing processing apparatus according to an embodiment of the present invention, which is used for performing a part of the audio signal mixing processing method according to the foregoing embodiment, and has corresponding functional modules and beneficial effects of the performing method. As shown in fig. 8, the apparatus specifically includes: an audio framing module 201, a voice endpoint detection module 202, and an audio signal transmission module 203, wherein,
an audio framing module 201, configured to perform framing processing on an audio signal to be mixed;
the voice endpoint detection module 202 is configured to perform feature extraction on the audio signal subjected to the framing processing, perform voice endpoint detection according to a feature extraction result, and generate a voice endpoint detection result containing identification information;
an audio signal sending module 203, configured to send an audio signal including the voice endpoint detection result, and perform audio signal mixing processing.
According to the scheme, the audio signal to be mixed is subjected to framing processing at the audio signal sending end, the audio signal subjected to framing processing is subjected to feature extraction, voice endpoint detection is carried out according to the feature extraction result, a voice endpoint detection result containing identification information is generated, the audio signal containing the voice endpoint detection result is sent for audio signal mixing processing, and the problems that the generation of mixed audio signals is delayed and the calculation amount is large when the voice endpoint detection is carried out on multi-channel audio signals at the audio receiving end are solved by carrying out the voice endpoint detection at the audio sending end. Meanwhile, the audio signal sending end sends the audio signal containing the voice endpoint detection result, so that the overall audio signal mixing mechanism is reasonably optimized, the audio receiving end can distinguish scenes conveniently, the auditory experience of a user is finally improved, and the processing complexity of the subsequent audio signal is reduced.
In a possible embodiment, the voice endpoint detection module 202 is specifically configured to perform normalized Mel-spectrum feature extraction on the audio signal after the framing processing; and inputting the Mel spectral feature extraction result to a scene distinguishing model obtained by pre-training and outputting a voice endpoint detection result.
Fig. 9 is a schematic structural diagram of an audio signal mixing processing apparatus according to an embodiment of the present invention, as shown in fig. 9, the apparatus includes a processor 301, a memory 302, an input device 303, and an output device 304; the number of the processors 301 in the device may be one or more, and one processor 301 is taken as an example in fig. 9; the processor 301, the memory 302, the input means 303 and the output means 304 in the device may be connected by a bus or other means, as exemplified by the bus connection in fig. 9. The memory 302 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the audio signal mixing processing method in the embodiment of the present invention. The processor 301 executes various functional applications of the device and data processing by running software programs, instructions, and modules stored in the memory 302, that is, implements the audio signal mixing processing method described above. The input device 303 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus. The output means 304 may comprise a display device such as a display screen.
An embodiment of the present invention further provides a storage medium containing computer-executable instructions, where the computer-executable instructions are executed by a computer processor to perform an audio signal mixing processing method described in the foregoing embodiment, and the method specifically includes:
receiving each path of audio signals containing voice endpoint detection results;
determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal;
and in the process of mixing the audio signals, carrying out sound mixing processing on the audio signals corresponding to the voice scene and the non-voice scene through different sound mixing processing algorithms.
And a method for performing another audio signal mixing processing described in the above embodiments, specifically including:
performing framing processing on the audio signals to be mixed;
carrying out feature extraction on the audio signal subjected to framing processing, carrying out voice endpoint detection according to a feature extraction result, and generating a voice endpoint detection result containing identification information;
and sending the audio signal containing the voice endpoint detection result for audio signal mixing processing.
It should be noted that, in the embodiment of the audio signal mixing processing apparatus, the units and modules included in the embodiment are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.
It should be noted that the foregoing is only a preferred embodiment of the present invention and the technical principles applied. Those skilled in the art will appreciate that the embodiments of the present invention are not limited to the specific embodiments described herein, and that various obvious changes, adaptations, and substitutions are possible, without departing from the scope of the embodiments of the present invention. Therefore, although the embodiments of the present invention have been described in more detail through the above embodiments, the embodiments of the present invention are not limited to the above embodiments, and many other equivalent embodiments may be included without departing from the concept of the embodiments of the present invention, and the scope of the embodiments of the present invention is determined by the scope of the appended claims.

Claims (13)

1. An audio signal mixing processing method, comprising:
receiving each path of audio signals containing voice endpoint detection results;
determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal;
and in the process of mixing the audio signals, carrying out sound mixing processing on the audio signals corresponding to the voice scene and the non-voice scene through different sound mixing processing algorithms.
2. The audio signal mixing processing method of claim 1, wherein the voice endpoint detection result comprises a voice identifier, the voice identifier is used to identify a voice portion and a non-voice portion in the audio signal, and the determining the voice scene and the non-voice scene in the mixed audio signal according to the voice endpoint detection result of each audio signal comprises:
and determining a voice scene and a non-voice scene in the mixed audio signal according to the voice identifiers in the audio signals at the same moment.
3. The audio signal mixing processing method according to claim 2, wherein the determining the voice scene and the non-voice scene in the mixed audio signal according to the voice identifiers in the audio signals at the same time comprises:
and determining the part of each audio signal with voice at the same time as the voice scene in the mixed audio signal, and determining the part of each audio signal with non-voice at the same time as the non-voice scene in the mixed audio signal.
4. The audio signal mixing processing method of claim 1, wherein the mixing processing of the audio signals corresponding to the speech scene and the non-speech scene by different mixing processing algorithms comprises:
performing audio mixing processing on each path of audio signal corresponding to the voice scene through a signal amplitude holding algorithm; and performing sound mixing processing on each path of audio signal corresponding to the non-voice scene through a multi-path signal summation average algorithm.
5. The audio signal mixing processing method of claim 1, wherein when performing mixing processing on the audio signals corresponding to the speech scene and the non-speech scene through different mixing processing algorithms, the method further comprises:
and when the sound mixing processing mode is switched, smoothing processing is carried out on each audio signal.
6. The audio signal mixing processing method of claim 1, wherein the mixing processing of the audio signals corresponding to the speech scene and the non-speech scene by different mixing processing algorithms comprises:
and determining a path signal to be mixed according to the voice endpoint detection result of each path of audio signal, and performing audio mixing processing on the audio signals corresponding to the path signal to be mixed corresponding to the voice scene and the non-voice scene through different audio mixing processing algorithms.
7. The audio signal mixing processing method according to claim 6, wherein the determining the path signals to be mixed according to the voice endpoint detection result of each path of audio signals comprises:
determining a channel signal containing voice according to the voice endpoint detection result of each channel of audio signal;
the path signal containing the speech is determined as the path signal to be mixed.
8. An audio signal mixing processing method, comprising:
performing framing processing on the audio signals to be mixed;
carrying out feature extraction on the audio signal subjected to framing processing, carrying out voice endpoint detection according to a feature extraction result, and generating a voice endpoint detection result containing identification information;
and sending the audio signal containing the voice endpoint detection result for audio signal mixing processing.
9. The audio signal mixing processing method according to claim 8, wherein the performing feature extraction on the audio signal after the framing processing comprises:
carrying out normalized Mel spectral feature extraction on the audio signal subjected to framing processing;
the voice endpoint detection according to the feature extraction result comprises the following steps:
and inputting the Mel spectral feature extraction result to a scene distinguishing model obtained by pre-training and outputting a voice endpoint detection result.
10. An audio signal mixing processing apparatus, comprising:
the audio signal receiving module is used for receiving each path of audio signals containing voice endpoint detection results;
the scene judgment module is used for determining a voice scene and a non-voice scene in the mixed audio signal according to the voice endpoint detection result of each path of audio signal;
and the audio mixing processing module is used for carrying out audio mixing processing on the audio scene and each path of audio signals corresponding to the non-audio scene through different audio mixing processing algorithms in the audio signal mixing processing process.
11. An audio signal mixing processing apparatus, comprising:
the audio framing module is used for framing the audio signals to be mixed;
the voice endpoint detection module is used for extracting the characteristics of the audio signal after the framing processing, detecting the voice endpoint according to the characteristic extraction result and generating a voice endpoint detection result containing identification information;
and the audio signal sending module is used for sending the audio signal containing the voice endpoint detection result and mixing the audio signal.
12. An audio signal mixing processing apparatus, the apparatus comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the audio signal mixing processing method of any one of claims 1-9.
13. A storage medium storing computer-executable instructions for performing the audio signal mixing processing method of any one of claims 1-9 when executed by a computer processor.
CN202210039333.XA 2022-01-13 2022-01-13 Audio signal mixing processing method, device, equipment and storage medium Pending CN114550748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210039333.XA CN114550748A (en) 2022-01-13 2022-01-13 Audio signal mixing processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210039333.XA CN114550748A (en) 2022-01-13 2022-01-13 Audio signal mixing processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114550748A true CN114550748A (en) 2022-05-27

Family

ID=81670753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210039333.XA Pending CN114550748A (en) 2022-01-13 2022-01-13 Audio signal mixing processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114550748A (en)

Similar Documents

Publication Publication Date Title
EP3257236B1 (en) Nearby talker obscuring, duplicate dialogue amelioration and automatic muting of acoustically proximate participants
US7881460B2 (en) Configuration of echo cancellation
JP3948904B2 (en) Teleconference bridge with edge point mixing
US10237412B2 (en) System and method for audio conferencing
US9979769B2 (en) System and method for audio conferencing
EP2868072B1 (en) Metric for meeting commencement in a voice conferencing system
US20120076305A1 (en) Spatial Audio Mixing Arrangement
CN109461455B (en) System and method for eliminating howling
US7020257B2 (en) Voice activity identiftication for speaker tracking in a packet based conferencing system with distributed processing
EP3111626B1 (en) Perceptually continuous mixing in a teleconference
CN101873363A (en) Method and terminal for restraining noise by using double microphones
CN111372121A (en) Echo cancellation method, device, storage medium and processor
EP2158753B1 (en) Selection of audio signals to be mixed in an audio conference
CN111628992B (en) Multi-person call control method and device, electronic equipment and storage medium
CN111951813A (en) Voice coding control method, device and storage medium
CN112289336A (en) Audio signal processing method and device
CN103606374A (en) Noise elimination and echo suppression method and device of thin terminal
CN114550748A (en) Audio signal mixing processing method, device, equipment and storage medium
CN114501238B (en) Microphone channel determination method and device, conference terminal and medium
WO2022156336A1 (en) Audio data processing method and apparatus, device, storage medium, and program product
CN112735455A (en) Method and device for processing sound information
CN117079661A (en) Sound source processing method and related device
CN117177134A (en) Processing method, processing device, storage medium and processing equipment
CN117793254A (en) Multipath audio processing method, device and terminal
CN112770222A (en) Audio processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination