CN109389989B - Sound mixing method, device, equipment and storage medium - Google Patents

Sound mixing method, device, equipment and storage medium Download PDF

Info

Publication number
CN109389989B
CN109389989B CN201710665368.3A CN201710665368A CN109389989B CN 109389989 B CN109389989 B CN 109389989B CN 201710665368 A CN201710665368 A CN 201710665368A CN 109389989 B CN109389989 B CN 109389989B
Authority
CN
China
Prior art keywords
mixing
audio stream
stream data
data
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710665368.3A
Other languages
Chinese (zh)
Other versions
CN109389989A (en
Inventor
吴威麒
张凯磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Qianwen wandaba Education Technology Co., Ltd
Original Assignee
Suzhou Qianwen Wandaba Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Qianwen Wandaba Education Technology Co Ltd filed Critical Suzhou Qianwen Wandaba Education Technology Co Ltd
Priority to CN201710665368.3A priority Critical patent/CN109389989B/en
Publication of CN109389989A publication Critical patent/CN109389989A/en
Application granted granted Critical
Publication of CN109389989B publication Critical patent/CN109389989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L21/0202
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels

Abstract

The invention discloses a sound mixing method, a sound mixing device, sound mixing equipment and a storage medium. The sound mixing method comprises the following steps: receiving audio stream data of at least two channels; detecting the types of audio stream data of all sound channels through a pre-trained human voice detection model so as to identify human voice channel audio stream data and noise sound channel audio stream data; mixing the audio stream data of the human sound channel to generate human sound mixing data; mixing the audio stream data of the noise channel to generate noise mixed sound data; and mixing the human voice mixing data and the noise mixing data to generate result mixing data. According to the embodiment of the invention, the voice channel audio stream data and the noise channel audio stream data are distinguished through the pre-trained voice detection model, then the voice mixing is respectively carried out on the voice channel audio stream data and the noise channel audio stream data, and the voice mixing results are superposed to generate the result voice mixing data.

Description

Sound mixing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of audio mixing technologies, and in particular, to an audio mixing method, apparatus, device, and storage medium.
Background
In a VOIP conference call, a plurality of people are involved in a conversation, and in order for a certain receiving party to hear the voice of all the other people, it is necessary to mix audio streams of all the other people. The audio mixing processing function is arranged at the server end, so that the bandwidth can be saved, the calculation pressure of the client end can be reduced, but the calculation pressure of the server can be increased, and the audio mixing processing function is suitable for a plurality of people to participate in conversation at the same time; the mixing processing function can also be set at the client for processing, and has no pressure on the server, so that the method is suitable for simultaneous conversation of a few people.
No matter which end the mixture is placed, it is required for the listener to clearly hear the speaker's voice, and the classical mixing algorithm in the prior art is a linear superposition algorithm, which is specifically as follows:
assuming that M persons are talking and the length of the audio data is N, the audio stream of the ith person is represented as xi(N), wherein i is 1 to M, and N is 1 to N.
Assuming that the mixing result is recorded as mix (n), the linear mixing calculation method:
Figure BDA0001371617230000011
the algorithm directly adopts linear processing on audio stream data of all channels, is simple and effective, and has no obvious distortion, but when the number of channels is increased, namely M is particularly large, the volume of human voice is obviously weakened, and the user experience is poor.
Disclosure of Invention
The invention provides a sound mixing method, a device, equipment and a storage medium, which can effectively solve the problem that the volume is reduced after multi-channel audio stream data is mixed, highlight the voice of a speaker and reduce the noise volume.
In a first aspect, an embodiment of the present invention provides a sound mixing method, including:
receiving audio stream data of at least two channels;
detecting the types of audio stream data of all sound channels through a pre-trained human voice detection model so as to identify human voice channel audio stream data and noise sound channel audio stream data;
mixing the voice channel audio stream data to generate voice mixing data;
mixing the audio stream data of the noise channel to generate noise mixing data;
and mixing the voice mixing data and the noise mixing data to generate result mixing data.
Further, the detecting the types of the audio stream data of all the channels through the pre-trained human voice detection model includes:
and training the human voice detection model through one algorithm of a GMM model based on a Gaussian probability density function, an SVM model based on a vector machine, a DNN model based on a neural network or a CNN model based on a convolutional neural network.
Further, the mixing the audio stream data of the human sound channel to generate human sound mixing data includes:
and carrying out sound mixing on the audio stream data of the human sound channel through linear sound mixing to generate human sound mixing data.
Further, the mixing the audio stream data of the noise channel to generate noise mixing data includes:
and carrying out sound mixing on the noise channel audio stream data through linear sound mixing to generate noise sound mixing data.
Further, before the mixing the audio stream data of the human sound channel and generating the human sound mixing data, the method further includes:
judging whether the audio stream data of the human voice channel is smaller than a preset adjusting amplitude value or not;
if so, normalizing the voice channel audio stream data to a first preset amplitude range, and generating the normalized voice channel audio stream data.
Further, before the mixing the audio stream data of the human sound channel and generating the human sound mixing data, the method further includes:
judging whether the difference between the amplitude of the human sound channel audio stream data and the amplitude of the noise channel audio stream data is within a preset amplitude difference range;
if so, normalizing the voice channel audio stream data to a second preset amplitude range to update the voice channel audio stream data; normalizing the noise channel audio stream data to a third preset amplitude range to update the noise channel audio stream, wherein the second preset amplitude range is larger than the third preset amplitude range.
Further, the mixing the audio stream data of the human sound channel to generate human sound mixing data includes:
obtaining the amplitude or the waveform envelope intensity of the energy of the audio stream data of each human voice channel;
distributing the sound mixing weight of the human sound channel audio stream data according to the proportion of the waveform envelope line intensity of each human sound channel audio stream data in the sum of the waveform envelope line intensities of all the human sound channel audio stream data, wherein the positive correlation between the sound mixing weight and the proportion is in a preset range;
and performing sound mixing according to the audio stream data of each human sound channel and the sound mixing weight corresponding to the audio stream to generate the human sound mixing data after sound mixing.
In a second aspect, an embodiment of the present invention further provides a sound mixing apparatus, including:
the audio stream receiving module is used for receiving audio stream data of at least two channels;
the voice detection module is used for detecting the types of audio stream data of all sound channels through a pre-trained voice detection model so as to identify the voice channel audio stream data and the noise channel audio stream data;
the voice sound mixing data generation module is used for mixing the voice channel audio stream data;
the noise mixing data generation module is used for mixing the noise channel audio stream data;
and the result mixed sound data generation module is used for mixing the human sound mixed sound data and the noise mixed sound data to generate result mixed sound data.
In a third aspect, an embodiment of the present invention further provides a mixing apparatus, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the mixing method according to the first aspect when executing the program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the mixing method according to the first aspect.
The embodiment of the invention provides a sound mixing method, a device, equipment and a storage medium, which distinguishes human sound channel audio flow data and noise channel audio flow data through a pre-trained human sound detection model, respectively mixes the human sound channel audio flow data and the noise channel audio flow data according to the distinguished result to generate human sound mixing data and noise mixing data, and then mixes the human sound mixing data and the noise mixing data to generate result mixing data, compared with the prior art that audio flow data of all channels are simultaneously linearly mixed, the human sound mixing data in the embodiment of the invention has no noise channel audio flow data to participate in mixing, so that the amplitude of the human sound mixing data is only related to the amplitude of the human sound channel audio flow data, and the noise data of the noise channel audio flow data has no human sound channel audio flow data to participate in mixing, so that the amplitude of the noise mixing data is only related to the amplitude of the noise channel audio flow data, therefore, the amplitude of the voice audio stream data in the result audio mixing data is far greater than that of the noise audio stream data, the voice after audio mixing is clearer, and the practicability and the user experience are better.
Drawings
FIG. 1 is a flowchart illustrating a mixing method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a mixing method according to a second embodiment of the present invention;
FIG. 3 is a flowchart of a mixing method according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a mixing apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of a mixing apparatus according to a fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a mixing method provided in an embodiment, where the method is applied to a scenario of online voice such as a multi-party teleconference or a network conference, and can be selectively executed by software/hardware deployed in a server or a client, as shown in fig. 1, the mixing method provided in this embodiment includes:
s102, receiving audio stream data of at least two channels.
And receiving audio stream data of all the channels in the working state in the telephone conference or the network conference.
S104, detecting the types of audio stream data of all the channels through a pre-trained voice detection model so as to identify the voice channel audio stream data and the noise channel audio stream data.
In this embodiment, the pre-trained human voice detection model is preferably, but not limited to, a GMM model based on a gaussian probability density function, an SVM model based on a vector machine, a DNN model based on a neural network, or a CNN model based on a convolutional neural network, and when in specific use, one of the algorithms may be selected to train the human voice detection model according to a use scene or an equipment parameter, and the human voice channel audio stream data and the noise channel audio stream data are identified by the human voice detection model.
And S106, mixing the audio stream data of the human sound channel to generate human sound mixing data.
The audio stream data of the human sound channel is mixed, and the specific mixing method can select an appropriate mixing method according to the actual use scene and the equipment parameters, such as linear mixing:
Figure BDA0001371617230000061
wherein x isi(n) is the voice channel audio stream data of the current time, K is the number of the voice channels, and mix-voice (n) is the voice mixing data of the current time.
And S108, mixing the audio stream data of the noise channel to generate noise mixing data.
The noise channel audio stream data is mixed, and the specific mixing method may select an appropriate mixing method according to an actual usage scenario and device parameters, such as linear mixing:
Figure BDA0001371617230000062
xi(n) is the noise channel audio stream data at the current time, K is the number of noise channels, and mix-noise (n) is the noise mixing data at the current time.
In the embodiment, the human sound channel audio stream data and the noise channel audio stream data can be mixed by the same mixing method, and can also be mixed by different mixing methods; in actual use, the mixing method can be selected according to specific use scenes, the number of channels or equipment parameters.
It should be noted that the order of obtaining the human voice mixing data and the noise mixing data in the present embodiment is only an example, and the order of obtaining the human voice mixing data and the noise mixing data in the present embodiment is not limited, and the human voice mixing data and the noise mixing data may be obtained simultaneously as needed, or the noise mixing data may be obtained first.
And S110, mixing the human voice mixing data and the noise mixing data to generate result mixing data.
In this embodiment, it is preferable that the human voice mixing data and the noise mixing data are directly superimposed and mixed to generate resultant mixing data.
mix(n)=mix_voice(n)+mix_noise(n)
Where mix (n) is the resulting mixed sound data at the current time.
Exemplarily, n channels are in a working state in a conference call, audio stream data of the n channels in the working state are obtained, all human sound channel audio stream data and all noise channel audio stream data at the current moment are detected through a GMM (Gaussian probability density) model based on a Gaussian probability density function, and human sound mixing data are obtained through linear mixing; noise mixing data of each noise channel is obtained through linear mixing; and then, the human voice sound mixing data and the noise sound mixing data are superposed to obtain the result sound mixing data. The human sound channel audio stream data and the noise channel audio stream data are not influenced by each other when being mixed respectively, the problem that the amplitude of the human sound channel audio stream data is reduced by the amplitude of the noise channel audio stream data is avoided, the human sound in the human sound mixing data keeps higher amplitude, and therefore the human sound in the mixing data can keep higher volume and tone quality.
In summary, in this embodiment, a pre-trained human detection model is used to distinguish human channel audio stream data and noise channel audio stream data, and mix the human channel audio stream data and the noise channel audio stream data according to the distinguishing result to generate human mixing data and noise mixing data, and then mix the human mixing data and the noise mixing data to generate mixing data, compared with the prior art that audio stream data of all channels are mixed linearly at the same time, the human mixing data in the embodiment of the present invention has no noise channel audio stream data to participate in mixing, so that the amplitude of the human mixing data is only related to the amplitude of the human channel audio stream data, and the noise mixing data in the noise channel audio stream data has no human channel audio stream data to participate in mixing, so that the amplitude of the noise mixing data is only related to the amplitude of the noise channel audio stream data, therefore, the amplitude of the voice audio stream data in the result audio mixing data is far greater than that of the noise audio stream data, the voice after audio mixing is clearer, and the practicability and the user experience are better.
Example two
Fig. 2 is a flowchart of a mixing method implemented by the present invention, and as shown in fig. 2, compared with the foregoing embodiment, the mixing method provided in this embodiment further includes, before mixing audio stream data of a human sound channel in the foregoing embodiment to generate human sound mixing data:
s1051, judging whether the audio stream data of the vocal tract is smaller than a preset adjusting amplitude value.
And S1052, if so, normalizing the voice channel audio stream data to a first preset amplitude range to update the voice channel audio stream data.
Within a preset time interval, when the maximum amplitude in the human sound channel audio stream data is lower than a preset adjustment amplitude, the human sound channel audio stream data is normalized to be within a first preset amplitude range, the amplitude of the human sound channel audio stream data is increased, so that the amplitude of the human sound channel audio stream data is increased before audio mixing, the amplitude difference between the human sound channel audio stream data and the noise channel audio stream data is increased, the amplitude difference between the human sound mixing data and the noise mixing data is increased, the amplitude of the human sound audio stream data in the result mixing data is finally increased, the mixed human sound has higher volume and tone quality, and the improvement of user experience is facilitated.
The first preset amplitude range in this embodiment is related to a specific use environment, and may be set according to the quality of the device and a specific use scenario, such as whether the environment where the conference participant is located is quiet.
Preferably, the present embodiment may mix the human channel audio stream data and the noise channel audio stream data by the same mixing method, or may mix the human channel audio stream data and the noise channel audio stream data by different mixing methods; the specific mixing method may select a suitable mixing method according to the actual usage scenario and the device parameters, such as a linear mixing method, a mixing method that assigns mixing weights based on the amplitude or energy of the human channel audio stream data, or a mixing method that assigns mixing weights based on the envelope intensity of the amplitude or energy of the human channel audio stream data.
When the mixing weight is allocated and mixed based on the amplitude or the energy of the human sound channel audio stream data, the mixing weight of each human sound channel is preferably in positive correlation with the proportion of the amplitude or the energy of the human sound channel audio stream data in the sum of the amplitudes or the energies of all the channel audio stream data, and then mixing is performed according to the mixing weight corresponding to each human sound channel audio stream data and the human sound channel audio stream data, so that the human sound mixing data is generated.
When the audio mixing is performed based on the amplitude or the envelope intensity of the audio stream data of the human sound channel, the mixing weight of each human sound channel is preferably proportional to the proportion of the envelope intensity of the audio stream data of the human sound channel to the sum of the envelope intensities of the audio stream data of all the channels, and then the audio mixing is performed according to the mixing weight corresponding to the audio stream data of each human sound channel and the audio stream data of the human sound channel, so as to generate the human mixing data.
Preferably, when there is only one piece of human sound channel audio stream data, the embodiment preferably mixes the noise channel audio stream data by a linear mixing method, and then superimposes the human sound channel audio stream data and the noise channel audio stream data; compared with the problem that the result mixed sound data generated by linear mixed sound of all human sound channel audio stream data and noise channel audio stream data in the prior art can reduce the human sound amplitude in the result mixed sound, the human sound audio stream in the embodiment is only mixed with the noise mixed sound data, so that the human sound channel audio stream data can keep higher amplitude, and the human sound in the result mixed sound data has higher volume and tone quality.
When there are at least two pieces of human sound channel audio stream data, the present embodiment preferably performs audio mixing on the human sound channel audio stream data by using the amplitude or the envelope intensity of the energy based on the human sound channel audio stream data, and performs audio mixing on the noise channel audio stream data by using linear audio mixing; the method comprises the following steps: firstly, the amplitude or the waveform envelope intensity of the energy of the audio stream data of each human voice channel is obtained; then distributing the sound mixing weight of the human sound channel audio stream data according to the proportion of the waveform envelope line strength of each human sound channel audio stream data in the sum of the waveform envelope line strengths of all the human sound channel audio stream data, wherein the positive correlation between the sound mixing weight and the proportion is in a preset range; mixing sound according to the sound mixing weight corresponding to each human sound channel audio flow data and the human sound channel audio flow data to generate human sound mixing data, mixing sound to noise channel audio flow data through linear mixing to generate noise mixing sound data, and finally overlapping the human sound mixing data and the noise mixing sound data to generate result mixing sound data. The envelope strength can reflect the trend of audio stream data, so that the voice audio stream data after the audio mixing can better accord with the rising and falling change of the voice by distributing the audio mixing weight according to the envelope strength of the voice channel audio stream data, and the problem that the voice is suddenly reduced after the audio mixing can be avoided; in addition, the calculation speed of obtaining the noise mixing data of the noise channel through linear mixing is high, so that the embodiment has high calculation speed and can improve the volume and quality of the voice after mixing.
EXAMPLE III
Fig. 3 is a flowchart of a mixing method implemented by the present invention, and as shown in fig. 3, in order to better improve human voice in the resulting mixed data, compared to the foregoing embodiment, this embodiment preferably further includes, before mixing audio stream data of a human voice channel to generate human voice mixed data:
s1053, judging whether the difference between the amplitude of the audio stream data of the human voice channel and the amplitude of the audio stream data of the noise channel is within the preset amplitude difference range.
S1054, if yes, normalizing the voice channel audio stream data to a second preset amplitude range to update the voice channel audio stream data; and normalizing the noise channel audio stream data to a third preset amplitude range to update the noise channel audio stream data, wherein the second preset amplitude range is larger than the third preset amplitude range.
In the embodiment, after distinguishing the voice channel audio stream data and the noise channel audio stream data through a pre-trained voice detection model, detecting the difference of the amplitudes between the voice channel audio stream data and the noise channel audio stream in real time, and normalizing the voice channel audio stream data to a second preset amplitude range when the difference of the amplitudes between the voice channel audio stream data and the noise channel audio stream is within a preset amplitude difference range so as to update the voice channel audio stream data; the noise channel audio stream data are normalized to be within a third preset amplitude range to update the noise channel audio stream data, and the second preset amplitude range is larger than the third preset amplitude range, so that the amplitude difference between the human sound channel audio stream data and the noise channel audio stream data is increased, the amplitude difference between the human sound mixing data and the noise mixing data is further increased, and finally, the volume and the tone quality of the human sound in the resulting mixing data are improved, and the user experience is improved. In the embodiment, the second preset amplitude range is preferably-0.8 to 0.8, and the third preset amplitude range is preferably-0.5 to 0.5.
In this embodiment, the human channel audio stream data and the noise channel audio stream data may be mixed by the same mixing method, or the human channel audio stream data and the noise channel audio stream data may be mixed by different mixing methods, and may be selected according to a use scene and device parameters in actual use, which is specifically referred to as embodiment two, and this embodiment is not described herein again.
In order to improve the quality of audio stream data and the sound quality after mixing, before separately obtaining the audio mixing data and the noise mixing data, it is preferable to perform preprocessing such as noise reduction on all audio stream data, for example, performing noise reduction on each audio stream data by using a gaussian noise reduction or wavelet noise reduction algorithm to increase the difference between the audio stream data and the noise audio stream data, and further increase the difference between the amplitude of the audio mixing data and the amplitude of the noise mixing data, and finally increase the difference between the amplitude of the audio and the amplitude of the noise in the resulting mixing data, which is beneficial to highlighting the audio in the resulting mixing data.
It should be noted that the above-mentioned noise reduction processing may be set before detecting the voice through the pre-trained voice detection model, or may be set after detecting the voice through the pre-trained voice detection model, and when the noise reduction processing is set after detecting the voice through the pre-trained voice detection model, the noise reduction processing may be performed on the audio stream data of the voice channel and the audio stream data of the noise channel respectively through different noise reduction processing methods, and through the targeted noise reduction processing, the amplitude difference between the audio stream data of the voice and the audio stream data of the noise is increased, so as to increase the difference between the amplitude of the audio mixed data of the voice and the amplitude of the noise mixed data, and finally increase the difference between the audio amplitude and the noise amplitude in the resulting mixed data, which is beneficial to highlight the voice in the resulting mixed data.
Example four
Fig. 4 is a schematic structural diagram of a mixing apparatus in a fourth embodiment of the present invention, where the mixing apparatus is suitable for mixing processing in a teleconference or a network conference, and may be implemented by software/hardware, and may be deployed in a server or a client application, as shown in fig. 4, the mixing apparatus provided in this embodiment includes: the system comprises an audio stream receiving module 101, a human voice detecting module 102, a human voice mixing data generating module 104, a noise mixing data generating module 105 and a result mixing data generating module 106, wherein the audio stream receiving module 101 is used for receiving audio stream data of at least two channels; the voice detection module 102 is configured to detect types of audio stream data of all channels through a pre-trained voice detection model to identify voice channel audio stream data and noise channel audio stream data; the voice mixing data generating module 104 is configured to mix voice channel audio stream data; a noise mixing data generating module 105, configured to mix audio stream data of a noise channel; and a resultant mixing data generating module 106, configured to mix the human voice mixing data with the noise mixing data.
The embodiment further includes an adjusting module 103, configured to determine whether the audio stream data of the vocal tract is smaller than a preset adjusting amplitude; if so, normalizing the voice channel audio stream data to a first preset amplitude range to generate normalized voice channel audio stream data; and mixing the normalized human sound channel audio stream data to generate human sound mixing data. The amplitude of the voice audio stream is increased on the premise of audio mixing, the difference between the voice channel audio stream data and the noise channel audio stream data is increased, the amplitude difference between the voice audio mixing data and the noise audio mixing data is further increased, the voice volume in the resulting audio mixing data is finally increased, the voice after audio mixing has higher volume and tone quality, and the improvement of user experience is facilitated.
The adjusting module 103 in this embodiment is further specifically configured to determine whether a difference between a magnitude of the audio stream data of the human voice channel and a magnitude of the audio stream data of the noise channel is within a preset magnitude difference range; if so, normalizing the voice channel audio stream data to be within a second preset amplitude range so as to update the voice channel audio stream data; and normalizing the noise channel audio stream data to a third preset amplitude range to update the noise channel audio stream, wherein the second preset amplitude range is larger than the third preset amplitude range. Therefore, the amplitude difference between the human sound channel audio stream data and the noise channel audio stream data is increased, the amplitude difference between the human sound mixing data and the noise mixing data is further increased, and finally the volume and the tone quality of the human sound in the result mixing data are improved, and the user experience is improved.
The audio mixing apparatus in this embodiment preferably further includes an audio mixing output apparatus, configured to output the audio after audio mixing, so that the audio after audio mixing is output to a receiving end of a client through an original channel, where the receiving end may be an earphone or a headset, or the audio after audio mixing is played through an external playing device, and at this time, the audio mixing output apparatus may be a speaker.
To sum up, the audio mixing method and apparatus provided in the embodiments of the present invention distinguish the audio stream data of the vocal tract and the audio stream of the noise tract through the pre-trained vocal detection model, and respectively mix the audio stream data of the vocal tract and the audio stream of the noise tract according to the distinguishing result to generate the audio mixing data of the vocal tract and the audio mixing data of the noise tract, and then mix the audio mixing data of the vocal tract and the audio mixing data of the noise tract to generate the resulting audio mixing data, compared with the prior art that audio stream data of all the vocal tracts are simultaneously mixed linearly, the audio mixing data of the vocal tract in the embodiments of the present invention has no audio stream data of the noise tract to participate in the audio mixing, so that the amplitude of the audio stream data of the vocal tract in the resulting audio mixing data is far greater than that of the audio stream data of the noise tract, the voice mixing device is beneficial to improving the voice mixing effect, makes the voice after voice mixing clearer, and has better practicability and user experience.
The sound mixing device can execute the sound mixing method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For the sake of a brief description, where parts of the apparatus embodiments are not mentioned, reference may be made to the corresponding contents in the preceding method embodiments.
EXAMPLE five
Fig. 5 is a schematic structural diagram of a mixing apparatus according to a fifth embodiment, and as shown in fig. 5, the mixing apparatus according to the present embodiment includes a processor 201, a memory 202, an input device 203, and an output device 204; the number of the processors 201 in the device may be one or more, and one processor 201 is taken as an example in fig. 5; the processor 201, the memory 202, the input device 203 and the output device 204 in the apparatus may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.
The memory 202, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the mixing method in the embodiment of the present invention (e.g., the audio stream receiving module 101, the human voice detecting module 102, the human mix data generating module 104, the noise mix data generating module 105, and the resultant mix data generating module 106). The processor 201 executes various functional applications of the apparatus and data processing by running software programs, instructions, and modules stored in the memory 202, that is, implements the above-described mixing method.
The memory 202 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 202 may further include memory located remotely from the processor 201, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 203 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the apparatus.
The output device 204 may include a display device such as a display screen, for example, of a user terminal.
EXAMPLE six
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the mixing method of any of the above method embodiments, the method comprising:
receiving audio stream data of at least two channels;
detecting the types of audio stream data of all sound channels through a pre-trained human voice detection model so as to identify human voice channel audio stream data and noise sound channel audio stream data;
mixing the voice channel audio stream data to generate voice mixing data;
mixing the audio stream data of the noise channel to generate noise mixing data;
and mixing the voice mixing data and the noise mixing data to generate result mixing data.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the above-described method operations, and may also perform related operations in the mixing method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the sound mixing method according to the embodiments of the present invention.
It should be noted that, in the embodiment of the intelligent check-in device, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (8)

1. A mixing method, comprising:
receiving audio stream data of at least two channels;
detecting the types of audio stream data of all sound channels through a pre-trained human voice detection model so as to identify human voice channel audio stream data and noise sound channel audio stream data;
mixing the voice channel audio stream data to generate voice mixing data;
mixing the audio stream data of the noise channel to generate noise mixing data;
mixing the voice mixing data and the noise mixing data to generate result mixing data;
before mixing the audio stream data of the human sound channel and generating human sound mixing data, the method further includes:
judging whether the amplitude of the audio stream data of the human sound channel is smaller than a preset adjusting amplitude or not;
if so, normalizing the amplitude of the voice channel audio stream data to a first preset amplitude range to update the voice channel audio stream data;
or, judging whether the difference between the amplitude of the human sound channel audio stream data and the amplitude of the noise channel audio stream data is within a preset amplitude difference range;
if so, normalizing the voice channel audio stream data to a second preset amplitude range to update the voice channel audio stream data; normalizing the noise channel audio stream data to a third preset amplitude range to update the noise channel audio stream data, wherein the second preset amplitude range is larger than the third preset amplitude range.
2. The method of claim 1, wherein the detecting the types of the audio stream data of all the channels through the pre-trained human voice detection model comprises:
and training the human voice detection model through one algorithm of a GMM model based on a Gaussian probability density function, an SVM model based on a vector machine, a DNN model based on a neural network or a CNN model based on a convolutional neural network.
3. The method of claim 1, wherein the mixing the human channel audio stream data to generate human mixing data comprises:
and carrying out sound mixing on the audio stream data of the human sound channel through linear sound mixing to generate human sound mixing data.
4. The method of claim 1, wherein the mixing the noise channel audio stream data to generate noise mixing data comprises:
and carrying out sound mixing on the noise channel audio stream data through linear sound mixing to generate noise sound mixing data.
5. The method of claim 1, wherein the mixing the human channel audio stream data to generate human mixing data comprises:
obtaining the amplitude or the waveform envelope intensity of the energy of the audio stream data of each human voice channel;
distributing the sound mixing weight of the human sound channel audio stream data according to the proportion of the waveform envelope line intensity of each human sound channel audio stream data in the sum of the waveform envelope line intensities of all the human sound channel audio stream data, wherein the positive correlation between the sound mixing weight and the proportion is in a preset range;
and carrying out sound mixing according to the audio stream data of each human sound channel and the sound mixing weight corresponding to the audio stream data to generate human sound mixing data.
6. An audio mixing apparatus, comprising:
the audio stream receiving module is used for receiving audio stream data of at least two channels;
the voice detection module is used for detecting the types of audio stream data of all sound channels through a pre-trained voice detection model so as to identify the voice channel audio stream data and the noise channel audio stream data;
the voice sound mixing data generation module is used for mixing the voice channel audio stream data;
the noise mixing data generation module is used for mixing the noise channel audio stream data;
a result sound mixing data generation module, configured to mix the human voice sound mixing data with the noise sound mixing data to generate result sound mixing data;
the adjusting module is used for judging whether the audio stream data of the human sound channel is smaller than a preset adjusting amplitude value or not;
if so, normalizing the voice channel audio stream data to a first preset amplitude range to generate normalized voice channel audio stream data;
or, the amplitude difference judging module is used for judging whether the difference between the amplitude of the audio stream data of the human sound channel and the amplitude of the audio stream data of the noise channel is within a preset amplitude difference range; if so, normalizing the voice channel audio stream data to be within a second preset amplitude range so as to update the voice channel audio stream data; and normalizing the noise channel audio stream data to a third preset amplitude range to update the noise channel audio stream, wherein the second preset amplitude range is larger than the third preset amplitude range.
7. A mixing apparatus comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements a mixing method according to any of claims 1 to 5.
8. A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a mixing method according to any one of claims 1 to 5.
CN201710665368.3A 2017-08-07 2017-08-07 Sound mixing method, device, equipment and storage medium Active CN109389989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710665368.3A CN109389989B (en) 2017-08-07 2017-08-07 Sound mixing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710665368.3A CN109389989B (en) 2017-08-07 2017-08-07 Sound mixing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109389989A CN109389989A (en) 2019-02-26
CN109389989B true CN109389989B (en) 2021-11-30

Family

ID=65413698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710665368.3A Active CN109389989B (en) 2017-08-07 2017-08-07 Sound mixing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109389989B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110671794A (en) * 2019-05-08 2020-01-10 青岛海尔空调器有限总公司 Method and equipment for controlling air outlet of air outlet device and air conditioner indoor unit
CN110677716B (en) * 2019-08-20 2022-02-01 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110808060A (en) * 2019-10-15 2020-02-18 广州国音智能科技有限公司 Audio processing method, device, equipment and computer readable storage medium
CN111405122B (en) * 2020-03-18 2021-09-24 苏州科达科技股份有限公司 Audio call testing method, device and storage medium
CN111491176B (en) * 2020-04-27 2022-10-14 百度在线网络技术(北京)有限公司 Video processing method, device, equipment and storage medium
US20210344798A1 (en) * 2020-05-01 2021-11-04 Walla Technologies Llc Insurance information systems
CN112599150A (en) * 2020-12-14 2021-04-02 广州智讯通信系统有限公司 Audio mixing method, device and storage medium based on command scheduling system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7149320B2 (en) * 2003-09-23 2006-12-12 Mcmaster University Binaural adaptive hearing aid
CN101546556A (en) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 Classification system for identifying audio content
CN101656072A (en) * 2009-09-08 2010-02-24 北京飞利信科技股份有限公司 Mixer, mixing method and session system using the mixer
CN101740038A (en) * 2008-11-04 2010-06-16 索尼株式会社 Sound processing apparatus, sound processing method and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7149320B2 (en) * 2003-09-23 2006-12-12 Mcmaster University Binaural adaptive hearing aid
CN101546556A (en) * 2008-03-28 2009-09-30 展讯通信(上海)有限公司 Classification system for identifying audio content
CN101740038A (en) * 2008-11-04 2010-06-16 索尼株式会社 Sound processing apparatus, sound processing method and program
CN101656072A (en) * 2009-09-08 2010-02-24 北京飞利信科技股份有限公司 Mixer, mixing method and session system using the mixer

Also Published As

Publication number Publication date
CN109389989A (en) 2019-02-26

Similar Documents

Publication Publication Date Title
CN109389989B (en) Sound mixing method, device, equipment and storage medium
US10142484B2 (en) Nearby talker obscuring, duplicate dialogue amelioration and automatic muting of acoustically proximate participants
CN110113316B (en) Conference access method, device, equipment and computer readable storage medium
US8249233B2 (en) Apparatus and system for representation of voices of participants to a conference call
US9589573B2 (en) Wind noise reduction
US10237412B2 (en) System and method for audio conferencing
CN111489760A (en) Speech signal dereverberation processing method, speech signal dereverberation processing device, computer equipment and storage medium
CN103827966A (en) Processing audio signals
CN105979197A (en) Remote conference control method and device based on automatic recognition of howling sound
CN110956976B (en) Echo cancellation method, device and equipment and readable storage medium
US20180048683A1 (en) Private communications in virtual meetings
US11240621B2 (en) Three-dimensional audio systems
CN107333093A (en) A kind of sound processing method, device, terminal and computer-readable recording medium
CN108429963A (en) A kind of earphone and noise-reduction method
CN109327633B (en) Sound mixing method, device, equipment and storage medium
CN107301028A (en) A kind of audio data processing method and device based on many people's distance communicatings
CN107623830A (en) A kind of video call method and electronic equipment
US20100266112A1 (en) Method and device relating to conferencing
US20210210116A1 (en) Microphone operations based on voice characteristics
CN110364175A (en) Sound enhancement method and system, verbal system
CN103794216B (en) A kind of sound mixing processing method and processing device
JP2523367B2 (en) Audio playback method
US20230066600A1 (en) Adaptive noise suppression for virtual meeting/remote education
US11783837B2 (en) Transcription generation technique selection
US9258428B2 (en) Audio bandwidth extension for conferencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200820

Address after: No.259 Nanjing West Road, Tangqiao town, Zhangjiagang City, Suzhou City, Jiangsu Province

Applicant after: Suzhou Qianwen wandaba Education Technology Co., Ltd

Address before: Yangpu District State Road 200433 Shanghai City No. 200 Building 5 room 2002

Applicant before: SHANGHAI QIANWENWANDABA CLOUD TECH. Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant