CN109510905B - Multi-channel voice mixing method and system - Google Patents

Multi-channel voice mixing method and system Download PDF

Info

Publication number
CN109510905B
CN109510905B CN201811488996.XA CN201811488996A CN109510905B CN 109510905 B CN109510905 B CN 109510905B CN 201811488996 A CN201811488996 A CN 201811488996A CN 109510905 B CN109510905 B CN 109510905B
Authority
CN
China
Prior art keywords
voice
channel
sound mixing
audio
mixing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811488996.XA
Other languages
Chinese (zh)
Other versions
CN109510905A (en
Inventor
韩冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Icsoc Beijing Communication Technology Co ltd
Original Assignee
Icsoc Beijing Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Icsoc Beijing Communication Technology Co ltd filed Critical Icsoc Beijing Communication Technology Co ltd
Priority to CN201811488996.XA priority Critical patent/CN109510905B/en
Publication of CN109510905A publication Critical patent/CN109510905A/en
Application granted granted Critical
Publication of CN109510905B publication Critical patent/CN109510905B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a multi-channel voice mixing method and a corresponding system, firstly, voice channels with effective audio data are screened out from the form as active channels according to the continuity and the voice energy of an audio data stream, and then the active channels providing valuable voice data are further screened out from the content as the voice channels participating in the current voice mixing according to the staggered condition of the change trend of each active channel and other active channels; and finally, carrying out sound mixing operation on the audio data stream passed by the selected voice to obtain sound mixing data. Through twice screening, the accuracy of screening the voice channels participating in sound mixing can be improved, the selected voice channels can provide valuable voice data in the voice conference, and meanwhile, the valuable voice data cannot be omitted, so that valuable speakers can be accurately selected from a plurality of participants of the voice conference, and sound mixing data can be accurately and efficiently provided for audiences.

Description

Multi-channel voice mixing method and system
Technical Field
The invention belongs to the technical field of voice conferences, and particularly relates to a multi-channel voice mixing method and a multi-channel voice mixing system.
Background
In recent years, with the development of communication technology, real-time voice conversation has progressed from a two-person conversation mode to a multi-person synchronous conversation mode. A typical application of multi-person conversation is a voice conference, and a modern voice conference can enable a plurality of persons to communicate and exchange at the same time, so that the problems that the traditional telephone conference is high in equipment dependence and can only transmit information in a single direction are effectively solved. Most of the existing voice conferences are characterized in that voice channels are sequenced directly according to parameters such as energy and persistence, the voice channel which is sequenced at the front is selected, voice data of the voice channel is extracted and mixed, and most of the voice data with lower resolution can be eliminated by the method. However, for the voice conference, the status and attention of the participants in the conference are not directly related to the volume, time, etc., so the voice channel screened by the above method may not really cover all valid voice information, and may be mixed with noise and other interference information, which seriously affects the effectiveness of the mixing information.
Disclosure of Invention
In order to solve the above technical problem, the present invention provides a method and a system for mixing multiple channels of speech.
The specific technical scheme of the invention is as follows:
the invention provides a multi-channel voice mixing method on one hand, which comprises the following steps:
s1: extracting and processing the audio data streams input by the voice channels, and selecting the voice channel with effective audio data as an active channel;
s2: searching for the position of each active channel which is staggered with the audio change trend of other active channels, counting the positions respectively, and selecting the voice channel participating in the sound mixing according to the staggered times and the quantity of the active channels staggered each time;
s3: and carrying out sound mixing operation on the audio data stream of the voice channel participating in sound mixing at this time based on the self-adaptive normalization algorithm to obtain sound mixing data.
Further, the specific method of step S1 is as follows:
s1.1: carrying out smoothing treatment on the audio data input by each voice channel to obtain a smooth audio data stream;
s1.2: detecting the persistence D of the audio data stream input by each voice channel, and calculating the voice energy E of the audio data stream input by each voice channel;
s1.3: setting a persistence threshold D0And an energy threshold E0D > D0And E > E0The voice channel of (a) is set as an active channel, and the remaining voice channels are set as inactive channels.
Further, the specific method of step S2 is as follows:
s2.1: making an audio frequency spectrum for each active channel according to a time axis;
s2.2: comparing all the audio frequency spectrums, finding out the staggered position of the peak value and the valley value in each audio frequency spectrum and other audio frequency spectrums, and counting the staggered times n;
s2.3: setting a threshold value n of the number of times of interleaving0N > n0The active channel corresponding to the audio frequency spectrum is selected as a main channel and used as a voice channel participating in the sound mixing, and the number of the main channels is n1
Further, step S2 further includes the following steps:
s2.4: setting a threshold N of the number of mixing channels according to requirements, wherein N is1N is less than or equal to the number of active channels, N is less than or equal to N0Audio frequency spectrum ofIs n2
When n is1+n2When the N is less than or equal to N, all N is less than or equal to N0Selecting an active channel corresponding to the audio frequency spectrum as an auxiliary channel to be used as a voice channel participating in the sound mixing;
when n is1+n2When N is greater than N, N is not greater than N0The audio frequency spectrum is counted, and a value P is calculated according to the interleaving times N, wherein the value P is N/N; calculating R value according to the maximum time number m of the interlaced audio frequency spectrum, wherein R is m/N, and calculating PR value of each audio frequency spectrum according to the R value, PR is P, R is mn/N2(ii) a Selecting the top N-N1The voice channel corresponding to the audio frequency spectrum of the bit is an auxiliary channel and is used as a voice channel participating in the sound mixing.
Further, the specific method of step S3 is as follows:
s3.1: selecting a sound mixing strategy, wherein the sound mixing strategy comprises centralized sound mixing performed by a server and distributed sound mixing performed by a terminal;
s3.2: and according to the sound mixing strategy, performing sound mixing operation on the audio data stream of the voice channel which is selected in the step S3 and participates in the sound mixing at this time by the corresponding equipment based on an adaptive normalization algorithm to obtain sound mixing data.
Further, the specific method of step S3.2 is as follows:
when the selected sound mixing strategy is centralized sound mixing, the server directly performs sound mixing operation on the selected audio data stream of the voice channel participating in the sound mixing;
and when the selected sound mixing strategy is distributed sound mixing, each terminal further screens out the voice channel participating in the sound mixing at the terminal according to the IP address or the signal intensity, and performs sound mixing operation.
Another aspect of the present invention provides a multi-channel audio mixing system, where the audio mixing system is disposed on a server or all terminals communicatively connected to the server, and the audio mixing system includes the following components:
the audio smoothing module is used for smoothing the audio data input by each voice channel to obtain a smooth audio data stream;
the active channel setting module is used for carrying out continuity detection and voice energy calculation on the audio data streams input by the voice channels, selecting a voice channel with effective audio data according to the results of continuity and voice energy, setting the voice channel as an active channel, and setting other voice channels as inactive channels;
the audio analyzing and screening module is used for searching for the position of the active channel which is staggered with the audio change trend of other active channels, counting the positions respectively, and selecting the voice channel which participates in the sound mixing according to the staggered times and the quantity of the active channels staggered each time;
and the audio synthesis module is used for carrying out sound mixing operation on the audio data stream of the voice channel participating in the sound mixing based on the self-adaptive normalization algorithm to obtain sound mixing data.
Further, the active channel setting module includes the following parts:
the parameter statistical unit is used for detecting the continuity D of the audio data streams input by the voice channels and calculating the voice energy E of the audio data streams input by the voice channels;
a determination unit for setting a persistence threshold D0And an energy threshold E0D > D0And E > E0The voice channel of (a) is set as an active channel, and the remaining voice channels are set as inactive channels.
Further, the audio analysis screening module comprises the following parts:
the audio frequency spectrum generating unit is used for making an audio frequency spectrum for each active channel according to a time axis;
a main channel screening unit for setting a time length threshold t0When the minimum distance between adjacent peak or valley regions in an audio spectrum exceeds t0The highest point or the lowest point in the two regions is considered as the peak value or the valley value of the two regions, and the minimum distance is the distance between the last frame of the previous region and the starting frame of the next region; comparing all audio frequency spectrums, and finding out peak value and valley in each audio frequency spectrumThe time of occurrence of the value is staggered with other audio frequency spectrums, and the number n of staggering times is counted; setting a threshold value n of the number of times of interleaving0N > n0The active channel corresponding to the audio frequency spectrum is selected as a main channel and used as a voice channel participating in the sound mixing, and the number of the main channels is n1
A screening strategy judgment unit for setting a threshold N of the number of mixing channels as required, wherein N is1N is less than or equal to the number of active channels, N is less than or equal to N0Of the audio frequency spectrum of n2(ii) a When n is1+n2When the content is less than or equal to N, adopting a screening strategy I; when n is1+n2When the number is more than N, adopting a second screening strategy;
and the auxiliary channel screening unit is used for screening the auxiliary channels according to the selected screening strategy:
when n is1+n2When the N is less than or equal to N, adopting a screening strategy I to enable all N to be less than or equal to N0Selecting an active channel corresponding to the audio frequency spectrum as an auxiliary channel to be used as a voice channel participating in the sound mixing;
when n is1+n2When N is greater than N, adopting a second screening strategy, and N is less than or equal to N0The audio frequency spectrum is counted, and a value P is calculated according to the interleaving times N, wherein the value P is N/N; calculating R value according to the maximum time number m of the interlaced audio frequency spectrum, wherein R is m/N, and calculating PR value of each audio frequency spectrum according to the R value, PR is P, R is mn/N2(ii) a Selecting the top N-N1The voice channel corresponding to the audio frequency spectrum of the bit is an auxiliary channel and is used as a voice channel participating in the sound mixing.
Further, when the mixing system is provided on a server, the audio synthesis module includes the following parts:
the centralized sound mixing unit is used for directly carrying out sound mixing operation on the selected audio data stream of the voice channel participating in sound mixing at this time based on a self-adaptive normalization algorithm;
when the audio mixing system is arranged on a terminal, the audio synthesis module comprises the following parts:
the autonomous screening unit is used for further screening out a voice channel participating in the sound mixing at the terminal according to the IP address or the signal strength;
and the distributed sound mixing unit is used for carrying out sound mixing operation on the selected audio data stream of the voice channel participating in the sound mixing at the terminal based on the self-adaptive normalization algorithm.
The invention has the following beneficial effects: the invention provides a multi-channel voice mixing method and a corresponding system, firstly, voice channels with effective audio data are screened out from the form as active channels according to the continuity and the voice energy of an audio data stream, and then the active channels providing valuable voice data are further screened out from the content as the voice channels participating in the current voice mixing according to the staggered condition of the change trend of each active channel and other active channels; and finally, carrying out sound mixing operation on the audio data stream passed by the selected voice to obtain sound mixing data. Through twice screening, the accuracy of screening the voice channels participating in sound mixing can be improved, the selected voice channels can provide valuable voice data in the voice conference, and meanwhile, the valuable voice data cannot be omitted, so that valuable speakers can be accurately selected from a plurality of participants of the voice conference, and sound mixing data can be accurately and efficiently provided for audiences.
Drawings
Fig. 1 is a flowchart of a multi-channel speech mixing method according to embodiment 1;
fig. 2 is a flowchart of step S1 in the method for mixing multiple paths of speech according to embodiment 1;
fig. 3 is a flowchart of step S2 in the method for mixing multiple paths of speech according to embodiment 1;
fig. 4 is a schematic structural diagram of a multi-channel speech mixing system according to embodiment 2;
fig. 5 is a schematic structural diagram of a multi-channel speech mixing system according to embodiment 3;
fig. 6 is a schematic structural diagram of a multi-channel speech mixing system according to embodiment 4.
Detailed Description
The present invention will be described in further detail with reference to the following examples and drawings.
Example 1
As shown in fig. 1, this embodiment 1 discloses a multi-channel speech mixing method, which includes the following steps:
s1: extracting and processing the audio data streams input by the voice channels, and selecting the voice channel with effective audio data as an active channel;
s2: searching for the position of each active channel which is staggered with the audio change trend of other active channels, counting the positions respectively, and selecting the voice channel participating in the sound mixing according to the staggered times and the quantity of the active channels staggered each time;
in voice conferencing, different participants often have different division, and the talkback, the talk spot, the audience, and the comment or comment making are performed by different participants, so that the time and energy for transmitting voice data in different voice channels are different, therefore, the audio frequency change conditions of each channel are compared, the importance of each voice channel can be judged according to the condition that different voice channels are mutually staggered (for example, other participants are all kept quiet when the main lecture and the cross talk are spoken, other participants begin to make comments or opinions after the main lecture and the cross talk are finished, and the speaking time of the main lecture is obviously longer than the cross talk), for voice channels with less interleaving times or fewer interleaving channels (participants keep silent or have short speaking time), if the voice channels do not provide valuable voice data, the voice channels are excluded, and the remaining effective channels are used for sound mixing;
s3: and carrying out sound mixing operation on the audio data stream of the voice channel participating in sound mixing at this time based on the self-adaptive normalization algorithm to obtain sound mixing data.
The embodiment provides a multi-channel voice mixing method, firstly, voice channels with effective audio data are screened out from the form as active channels according to the continuity and voice energy of an audio data stream, and then, according to the staggering condition of the change trend of each active channel and other active channels, the active channels providing valuable voice data are further screened out from the content as the voice channels participating in the current mixing; and finally, carrying out sound mixing operation on the audio data stream passed by the selected voice to obtain sound mixing data. Through twice screening, the accuracy of screening the voice channels participating in sound mixing can be improved, the selected voice channels can provide valuable voice data in the voice conference, and meanwhile, the valuable voice data cannot be omitted, so that valuable speakers can be accurately selected from a plurality of participants of the voice conference, and sound mixing data can be accurately and efficiently provided for audiences.
In specific implementation, as shown in fig. 2, the specific method of step S1 is as follows:
s1.1: carrying out smoothing treatment on the audio data input by each voice channel to obtain a smooth audio data stream;
the smoothing process comprises decoding, packet loss concealment and jitter control, wherein the packet loss concealment and the jitter control are mainly based on an out-of-band FEC technique and a dynamic adaptive control technique of in-band FEC;
s1.2: detecting the persistence D of the audio data stream input by each voice channel, and calculating the voice energy E of the audio data stream input by each voice channel;
s1.3: setting a persistence threshold D0And an energy threshold E0D > D0And E > E0The voice channel of (a) is set as an active channel, and the remaining voice channels are set as inactive channels.
The persistence detection can be realized by a first-order lag filter, and the voice energy can be obtained by calculating the integral energy of 300 ms; because there may be voice channels that do not output voice (do not talk) or do not output valid voice (only noise such as switching on/off door sound, telephone ring or speaking sound of other irrelevant people), in order to screen out valid information from them, according to the set persistence and threshold of voice energy, the voice channel with top ranking (i.e. strong persistence, large voice energy, high voice information validity) can be selected as the active channel, and the voice information with weak persistence (short time), low energy (small volume, poor stability) can be used as the inactive audio, and the corresponding voice channel is set as the inactive channel to be excluded;
since each voice channel may be mixed with noise (such as door opening and closing sound, telephone ring or other speaking sound of irrelevant personnel) in the process of transmitting voice information, by setting the thresholds of continuity and voice energy, the corresponding voice channel is set as an inactive channel; only the voice information with persistence and energy exceeding the threshold can be considered as effective audio which can participate in the audio mixing, and the corresponding voice channel is the active channel.
In specific implementation, as shown in fig. 3, the specific method of step S2 is as follows:
s2.1: making an audio frequency spectrum (using eight frequency-division fft) for each active channel according to a time axis;
s2.2: setting a time length threshold t0When the minimum distance between adjacent peak or valley regions in an audio spectrum (i.e. the distance between the last frame of a preceding region and the start frame of a succeeding region) exceeds t0The highest point or the lowest point in the two areas is considered as the peak value or the valley value of the two areas; comparing all the audio frequency spectrums, finding out the staggered position of the peak value and valley value appearing time in each audio frequency spectrum and other audio frequency spectrums, and counting the staggered times n;
s2.3: setting a threshold value n of the number of times of interleaving0N > n0The active channel corresponding to the audio frequency spectrum is selected as a main channel and used as a voice channel participating in the sound mixing, and the number of the main channels is n1
The peak value in the audio frequency spectrum is the position where the voice output of the voice channel reaches the maximum (namely, the volume reaches the maximum or the speaking frequency reaches the maximum), and the valley value is the position where the voice output of the voice channel reaches the minimum (namely, the volume reaches the minimum or the speaking frequency reaches the minimum); a plurality of sounds may be mixed in an audio frequency spectrum, namely, a plurality of sound wave curves exist, but the overall fluctuation trend of the sound wave curves is consistent, so that the highest point is searched from the position where the peak values of the plurality of sound wave curves are concentrated (namely, the peak area) to serve as the peak value; similarly, the lowest point is found from the valley region where the valleys of the multiple sonic curves collectively appear (i.e., valley region) as the valley. In addition, in order to avoid segmentation errors in the statistical process and errors in statistics, a time threshold t0 needs to be set, and only when the distance (time interval) between adjacent peak regions or valley regions exceeds the threshold, the sound transmission can be considered to be interrupted indeed, otherwise, only the sound transmission can be considered to be normal sentence-breaking.
The peak value and the valley value in one audio frequency spectrum are staggered with other audio frequency spectrums, which indicates that the time point of the corresponding voice channel for outputting the voice is different from other audio frequency spectrums (other users are silent when the user of the voice channel speaks, and other users begin speaking after the user of the voice channel finishes speaking), so that the roles played by the users corresponding to different active channels in the voice conference can be judged according to the above, and the importance degree of each active channel can be determined; for the audio frequency spectrum with the interleaving times exceeding the preset threshold, the audio frequency spectrum is considered to have higher importance degree and has the values of mixing and spreading, and therefore the audio frequency spectrum is selected as a main channel and participates in the mixing.
Step S2 further includes the steps of:
s2.4: setting a threshold N of the number of mixing channels according to requirements, wherein N is1N is less than or equal to the number of active channels, N is less than or equal to N0Of the audio frequency spectrum of n2
When n is1+n2When the N is less than or equal to N, all N is less than or equal to N0Selecting an active channel corresponding to the audio frequency spectrum as an auxiliary channel to be used as a voice channel participating in the sound mixing;
when n is1+n2When N is greater than N, N is not greater than N0The audio frequency spectrum is counted, and a value P is calculated according to the interleaving times N, wherein the value P is N/N; calculating R value according to the maximum time number m of the interlaced audio frequency spectrum, wherein R is m/N, and calculating PR value of each audio frequency spectrum according to the R value, PR is P, R is mn/N2(ii) a Selecting the top N-N1The voice channel corresponding to the audio frequency spectrum of the bit is an auxiliary channel and is used as a voice channel participating in the sound mixing.
In order to fully collect valuable voice information, ensure accurate information of sound mixing and reduce noise, a threshold value of the number of voice channels participating in the sound mixing is limited, and the number of the voice channels participating in the sound mixing is connected as far as possible within the threshold value rangeNear the threshold; when n > n0The audio frequency spectrum and n is less than or equal to n0Sum of audio frequency spectra n1+n2(i.e., the number of active channels) does not exceed the threshold, all n ≦ n0The voice channel corresponding to the audio frequency spectrum is set as an auxiliary channel and participates in the sound mixing; when n is1+n2And when the sum of the main channel and the auxiliary channel is equal to the threshold, calculating the PR value according to the algorithm, sequencing the PR value, selecting the voice channel corresponding to the audio frequency spectrum with the highest rank as the auxiliary channel and participating in the sound mixing at the time.
In specific implementation, the specific method of step S3 is as follows:
s3.1: selecting a sound mixing strategy, wherein the sound mixing strategy comprises centralized sound mixing performed by a server and distributed sound mixing performed by a terminal;
s3.2: and according to the sound mixing strategy, performing sound mixing operation on the audio data stream of the voice channel which is selected in the step S3 and participates in the sound mixing at this time by the corresponding equipment based on an adaptive normalization algorithm to obtain sound mixing data.
The specific method of step S3.2 is as follows:
when the selected sound mixing strategy is centralized sound mixing, the server directly carries out sound mixing operation on the selected audio data stream of the voice channel participating in the sound mixing;
and when the selected sound mixing strategy is distributed sound mixing, each terminal further screens out the voice channel participating in the sound mixing at the terminal according to the IP address or the signal intensity, and performs sound mixing operation.
When centralized sound mixing is adopted, the server directly performs the screening process and performs sound mixing, and then distributes the sound to each terminal, so that the operation is simpler and more centralized; when the distributed sound mixing is adopted, each terminal can further screen the voice channel participating in the sound mixing at the terminal according to subjective needs or objective conditions, the sound mixing results of the terminals are possibly different, and the individual requirements of different participants can be better met.
Example 2
As shown in fig. 4, embodiment 2 of the present invention provides a multi-channel speech mixing system, where the mixing system is disposed on a server or all terminals communicatively connected to the server, and the mixing system includes the following components:
the audio smoothing module 1 is used for smoothing audio data input by each voice channel to obtain a smooth audio data stream;
the active channel setting module 2 is used for carrying out continuity detection and voice energy calculation on the audio data streams input by the voice channels, selecting a voice channel with effective audio data according to the results of continuity and voice energy, setting the voice channel as an active channel, and setting other voice channels as inactive channels;
the audio analyzing and screening module 3 is used for searching for the positions of the active channels, which are staggered with the audio change trend of other active channels, for each active channel, counting the positions respectively, and selecting the voice channels participating in the sound mixing according to the staggered times and the number of the active channels staggered each time;
and the audio synthesis module 4 is configured to perform audio mixing operation on the audio data stream of the voice channel participating in audio mixing based on an adaptive normalization algorithm to obtain audio mixing data.
The embodiment provides a multi-channel voice mixing system, firstly, voice channels with effective audio data are screened out in a form as active channels through an active channel setting module 2 according to the continuity and voice energy of an audio data stream, and then, according to the staggered condition of the change trend of each active channel and other active channels, active channels providing valuable voice data are further screened out from the content through an audio analysis screening module 3 to be used as voice channels participating in the current mixing; and finally, the audio synthesis module 4 performs audio mixing operation on the audio data stream passed by the selected voice to obtain audio mixing data. Through twice screening, the accuracy of screening the voice channels participating in sound mixing can be improved, the selected voice channels can provide valuable voice data in the voice conference, and meanwhile, the valuable voice data cannot be omitted, so that valuable speakers can be accurately selected from a plurality of participants of the voice conference, and sound mixing data can be accurately and efficiently provided for audiences.
Example 3
As shown in fig. 5, embodiment 3 discloses a mixing system for multi-channel speech based on embodiment 2, where embodiment 3 further defines that the mixing system is disposed on a server, and the audio synthesis module 4 includes the following parts:
the centralized sound mixing unit 41 is configured to perform a sound mixing operation on the selected audio data stream of the voice channel participating in the sound mixing this time directly based on an adaptive normalization algorithm;
the active channel setting module 2 includes the following parts:
the parameter statistical unit 21 is configured to detect a persistence D of an audio data stream input by each voice channel, and calculate a voice energy E of the audio data stream input by each voice channel;
a determination unit 22 for setting a persistence threshold D0And an energy threshold E0The voice channels D > D0 and E > E0 are set as active channels, and the remaining voice channels are set as inactive channels.
The audio analysis screening module 3 includes the following parts:
an audio spectrum generation unit 31 for making an audio spectrum for each active channel according to a time axis;
a main channel screening unit 32 for setting a time duration threshold t0When the minimum distance between adjacent peak or valley regions in an audio spectrum (i.e. the distance between the last frame of a preceding region and the start frame of a succeeding region) exceeds t0The highest point or the lowest point in the two areas is considered as the peak value or the valley value of the two areas; comparing all the audio frequency spectrums, finding out the staggered position of the peak value and valley value appearing time in each audio frequency spectrum and other audio frequency spectrums, and counting the staggered times n; setting a threshold value n of the number of times of interleaving0N > n0The active channel corresponding to the audio frequency spectrum is selected as a main channel and used as a voice channel participating in the sound mixing, and the number of the main channels is n1
A screening strategy judgment unit 33 for setting a threshold N of the number of mixing channels according to the requirement, wherein N is1N is less than or equal to the number of active channels, N is less than or equal to N0Of the audio frequency spectrum of n2(ii) a When n is1+n2When the content is less than or equal to N, adopting a screening strategy I; when n is1+n2When the number is more than N, adopting a second screening strategy;
an auxiliary channel screening unit 34, configured to perform screening of auxiliary channels according to the selected screening policy:
when n is1+n2When the content is less than or equal to N, all N is less than or equal to N0Selecting an active channel corresponding to the audio frequency spectrum as an auxiliary channel to be used as a voice channel participating in the sound mixing;
when n is1+n2When more than N, N is not more than N0The audio frequency spectrum is counted, and a value P is calculated according to the interleaving times N, wherein the value P is N/N; calculating R value according to the maximum time number m of the interlaced audio frequency spectrum, wherein R is m/N, and calculating PR value of each audio frequency spectrum according to the R value, PR is P, R is mn/N2(ii) a Selecting the top N-N1The voice channel corresponding to the audio frequency spectrum of the bit is an auxiliary channel and is used as a voice channel participating in the sound mixing.
Example 4
As shown in fig. 6, embodiment 4 discloses a mixing system for multi-channel speech based on embodiment 2, where embodiment 4 further defines that the mixing system is disposed on a terminal, and the audio synthesis module 4 includes the following parts:
an autonomous screening unit 42, configured to further screen out, according to the IP address or the signal strength, a voice channel participating in the audio mixing at this time at this terminal;
and the distributed audio mixing unit 43 is configured to perform audio mixing operation on the audio data stream of the selected voice channel participating in the audio mixing at the terminal based on the adaptive normalization algorithm.
At this time, the specific structures of the active channel setting module 2 and the audio analysis filtering module 3 are the same as those of embodiment 6.
The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Although a logical order is shown in the flow diagrams, in some cases, the steps may be performed in an order different than here.
Since the method description of the invention is implemented in a computer system. The computer system may be provided in a processor of a server or a client, for example. For example, the methods described herein may be implemented as software executable with control logic that is executed by a CPU in a server. The functionality described herein may be implemented as a set of program instructions stored in a non-transitory tangible computer readable medium. When implemented in this manner, the computer program comprises a set of instructions which, when executed by a computer, cause the computer to perform a method capable of carrying out the functions described above. Programmable logic may be temporarily or permanently installed in a non-transitory tangible computer-readable medium, such as a read-only memory chip, computer memory, disk, or other storage medium. In addition to being implemented in software, the logic described herein may be embodied using a discrete component, an integrated circuit, programmable logic used in conjunction with a programmable logic device such as a Field Programmable Gate Array (FPGA) or microprocessor, or any other device including any combination thereof. All such implementations are intended to fall within the scope of the present invention.

Claims (9)

1. A multi-channel voice mixing method is characterized by comprising the following steps:
s1: extracting and processing the audio data streams input by the voice channels, and selecting the voice channel with effective audio data as an active channel;
s2: searching for the position of each active channel which is staggered with the audio change trend of other active channels, counting the positions respectively, and selecting the voice channel participating in the sound mixing according to the staggered times and the quantity of the active channels staggered each time;
s3: performing sound mixing operation on the audio data stream of the voice channel participating in sound mixing at this time based on an adaptive normalization algorithm to obtain sound mixing data;
the specific method of step S2 is as follows:
s2.1: making an audio frequency spectrum for each active channel according to a time axis;
s2.2: setting a time length threshold t0When being oneThe minimum distance between adjacent peak regions or valley regions in the individual audio frequency spectra exceeds t0The highest point or the lowest point in the two regions is considered as the peak value or the valley value of the two regions, and the minimum distance is the distance between the last frame of the previous region and the starting frame of the next region; comparing all the audio frequency spectrums, finding out the staggered position of the peak value and the valley value in each audio frequency spectrum and other audio frequency spectrums, and counting the staggered times n;
s2.3: setting a threshold value n of the number of times of interleaving0N > n0The active channel corresponding to the audio frequency spectrum is selected as a main channel and used as a voice channel participating in the sound mixing, and the number of the main channels is n1
2. The method for mixing multi-channel speech as claimed in claim 1, wherein the specific method of step S1 is as follows:
s1.1: carrying out smoothing treatment on the audio data input by each voice channel to obtain a smooth audio data stream;
s1.2: detecting the persistence D of the audio data stream input by each voice channel, and calculating the voice energy E of the audio data stream input by each voice channel;
s1.3: setting a persistence threshold D0And an energy threshold E0D > D0And E > E0The voice channel of (a) is set as an active channel, and the remaining voice channels are set as inactive channels.
3. The multi-channel speech mixing method of claim 1, wherein the step S2 further comprises the steps of:
s2.4: setting a threshold N of the number of mixing channels according to requirements, wherein N is1N is less than or equal to the number of active channels, N is less than or equal to N0Of the audio frequency spectrum of n2
When n is1+n2When the N is less than or equal to N, all N is less than or equal to N0Selecting an active channel corresponding to the audio frequency spectrum as an auxiliary channel to be used as a voice channel participating in the sound mixing;
when n is1+n2When N is greater than N, N is not greater than N0The audio frequency spectrum is counted, and a value P is calculated according to the interleaving times N, wherein the value P is N/N; calculating R value according to the maximum time number m of the interlaced audio frequency spectrum, wherein R is m/N, and calculating PR value of each audio frequency spectrum according to the R value, PR is P, R is mn/N2(ii) a Selecting the top N-N1The voice channel corresponding to the audio frequency spectrum of the bit is an auxiliary channel and is used as a voice channel participating in the sound mixing.
4. The method for mixing multi-channel speech as claimed in claim 1, wherein the specific method of step S3 is as follows:
s3.1: selecting a sound mixing strategy, wherein the sound mixing strategy comprises centralized sound mixing performed by a server and distributed sound mixing performed by a terminal;
s3.2: and according to the sound mixing strategy, performing sound mixing operation on the audio data stream of the voice channel which is selected in the step S3 and participates in the sound mixing at this time by the corresponding equipment based on an adaptive normalization algorithm to obtain sound mixing data.
5. The method for mixing multi-channel speech as claimed in claim 4, wherein the specific method of step S3.2 is as follows:
when the selected sound mixing strategy is centralized sound mixing, the server directly performs sound mixing operation on the selected audio data stream of the voice channel participating in the sound mixing;
and when the selected sound mixing strategy is distributed sound mixing, each terminal further screens out the voice channel participating in the sound mixing at the terminal according to the IP address or the signal intensity, and performs sound mixing operation.
6. A multi-channel voice mixing system is characterized in that the mixing system is arranged on a server or all terminals which are in communication connection with the server, and the mixing system comprises the following parts:
the audio smoothing module (1) is used for smoothing audio data input by each voice channel to obtain a smooth audio data stream;
the active channel setting module (2) is used for carrying out persistence detection and voice energy calculation on the audio data streams input by the voice channels, selecting a voice channel with effective audio data according to the results of persistence and voice energy, setting the voice channel as an active channel, and setting other voice channels as inactive channels;
the audio analyzing and screening module (3) is used for searching the audio change trend crossed part with other active channels for each active channel, counting the audio change trend respectively, and selecting the voice channel participating in the sound mixing according to the crossing times and the quantity of the active channels crossed each time;
the audio synthesis module (4) is used for carrying out sound mixing operation on the audio data stream of the voice channel participating in sound mixing based on a self-adaptive normalization algorithm to obtain sound mixing data;
the audio analysis screening module (3) comprises the following parts:
an audio spectrum generation unit (31) for making an audio spectrum for each active channel according to a time axis;
a main channel screening unit (32) for setting a duration threshold t0When the minimum distance between adjacent peak or valley regions in an audio spectrum exceeds t0The highest point or the lowest point in the two regions is considered as the peak value or the valley value of the two regions, and the minimum distance is the distance between the last frame of the previous region and the starting frame of the next region; comparing all the audio frequency spectrums, finding out the staggered position of the peak value and the valley value in each audio frequency spectrum and other audio frequency spectrums, and counting the staggered times n; setting a threshold value n of the number of times of interleaving0N > n0The active channel corresponding to the audio frequency spectrum is selected as a main channel and used as a voice channel participating in the sound mixing, and the number of the main channels is n1
7. The mixing system of multi-channel speech according to claim 6, characterized in that the active channel setting module (2) comprises the following parts:
the parameter statistical unit (21) is used for detecting the persistence D of the audio data streams input by the voice channels and calculating the voice energy E of the audio data streams input by the voice channels;
a determination unit (22) for setting a persistence threshold D0And an energy threshold E0D > D0And E > E0The voice channel of (a) is set as an active channel, and the remaining voice channels are set as inactive channels.
8. The mixing system of multi-channel speech according to claim 6, characterized in that the audio analysis filtering module (3) further comprises the following parts:
a screening strategy judgment unit (33) for setting a threshold value N of the number of mixing channels according to the requirement, wherein N is1N is less than or equal to the number of active channels, N is less than or equal to N0Of the audio frequency spectrum of n2(ii) a When n is1+n2When the content is less than or equal to N, adopting a screening strategy I; when n is1+n2When the number is more than N, adopting a second screening strategy;
a secondary channel screening unit (34) for screening secondary channels according to the selected screening strategy:
when n is1+n2When the N is less than or equal to N, adopting a screening strategy I to enable all N to be less than or equal to N0Selecting an active channel corresponding to the audio frequency spectrum as an auxiliary channel to be used as a voice channel participating in the sound mixing;
when n is1+n2When N is greater than N, adopting a second screening strategy, and N is less than or equal to N0The audio frequency spectrum is counted, and a value P is calculated according to the interleaving times N, wherein the value P is N/N; calculating R value according to the maximum time number m of the interlaced audio frequency spectrum, wherein R is m/N, and calculating PR value of each audio frequency spectrum according to the R value, PR is P, R is mn/N2(ii) a Selecting the top N-N1The voice channel corresponding to the audio frequency spectrum of the bit is an auxiliary channel and is used as a voice channel participating in the sound mixing.
9. The mixing system of multi-channel speech according to claim 6, wherein when the mixing system is provided on a server, the audio synthesis module (4) comprises:
the centralized sound mixing unit (41) is used for directly carrying out sound mixing operation on the audio data stream of the selected voice channel participating in sound mixing at this time based on an adaptive normalization algorithm;
when the mixing system is arranged on a terminal, the audio synthesis module (4) comprises the following parts:
an autonomous screening unit (42) for further screening out voice channels participating in the current sound mixing at the terminal according to the IP address or the signal strength;
and the distributed sound mixing unit (43) is used for carrying out sound mixing operation on the audio data stream of the selected voice channel participating in the sound mixing at the terminal based on the self-adaptive normalization algorithm.
CN201811488996.XA 2018-12-06 2018-12-06 Multi-channel voice mixing method and system Active CN109510905B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811488996.XA CN109510905B (en) 2018-12-06 2018-12-06 Multi-channel voice mixing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811488996.XA CN109510905B (en) 2018-12-06 2018-12-06 Multi-channel voice mixing method and system

Publications (2)

Publication Number Publication Date
CN109510905A CN109510905A (en) 2019-03-22
CN109510905B true CN109510905B (en) 2020-10-30

Family

ID=65751657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811488996.XA Active CN109510905B (en) 2018-12-06 2018-12-06 Multi-channel voice mixing method and system

Country Status (1)

Country Link
CN (1) CN109510905B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110166185B (en) * 2019-05-15 2021-07-09 广州视源电子科技股份有限公司 Audio data processing method and device, storage medium and processor
CN110602338B (en) * 2019-07-30 2021-04-13 广州视源电子科技股份有限公司 Audio processing method, device, system, storage medium and equipment
CN110675885B (en) * 2019-10-17 2022-03-22 浙江大华技术股份有限公司 Sound mixing method, device and storage medium
CN116471263B (en) * 2023-05-12 2024-02-13 杭州全能数字科技有限公司 Real-time audio routing method for video system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1443006A (en) * 2003-01-16 2003-09-17 浙江南望图像信息产业有限公司 Mixed sound system of intelligent controlled video frequency conference and method of controlling conference course
KR100731546B1 (en) * 2004-12-30 2007-06-22 한국전자통신연구원 Method and apparatus for compensating spectrums of each output channel separated from a downmix audio signal
CN102056053A (en) * 2010-12-17 2011-05-11 中兴通讯股份有限公司 Multi-microphone audio mixing method and device
CN104616665A (en) * 2015-01-30 2015-05-13 深圳市云之讯网络技术有限公司 Voice similarity based sound mixing method
CN107800902A (en) * 2017-09-15 2018-03-13 北京容联易通信息技术有限公司 The sound mixing method and system of multi-path voice

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4082611B2 (en) * 2004-05-26 2008-04-30 インターナショナル・ビジネス・マシーンズ・コーポレーション Audio recording system, audio processing method and program
JP5867187B2 (en) * 2012-03-09 2016-02-24 ヤマハ株式会社 Acoustic signal processing system
CN105959438A (en) * 2016-07-06 2016-09-21 惠州Tcl移动通信有限公司 Processing method and system for audio multi-channel output loudspeaker and mobile phone

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1443006A (en) * 2003-01-16 2003-09-17 浙江南望图像信息产业有限公司 Mixed sound system of intelligent controlled video frequency conference and method of controlling conference course
KR100731546B1 (en) * 2004-12-30 2007-06-22 한국전자통신연구원 Method and apparatus for compensating spectrums of each output channel separated from a downmix audio signal
CN102056053A (en) * 2010-12-17 2011-05-11 中兴通讯股份有限公司 Multi-microphone audio mixing method and device
CN104616665A (en) * 2015-01-30 2015-05-13 深圳市云之讯网络技术有限公司 Voice similarity based sound mixing method
CN107800902A (en) * 2017-09-15 2018-03-13 北京容联易通信息技术有限公司 The sound mixing method and system of multi-path voice

Also Published As

Publication number Publication date
CN109510905A (en) 2019-03-22

Similar Documents

Publication Publication Date Title
CN109510905B (en) Multi-channel voice mixing method and system
US9456273B2 (en) Audio mixing method, apparatus and system
EP2868072B1 (en) Metric for meeting commencement in a voice conferencing system
CN101502089B (en) Method for carrying out an audio conference, audio conference device, and method for switching between encoders
CN104767652B (en) Method for monitoring performance of digital transmission environment
TWI504272B (en) Video conference system and display allocation method thereof
WO2021179651A1 (en) Call audio mixing processing method and apparatus, storage medium, and computer device
Li et al. Monaural speech separation based on computational auditory scene analysis and objective quality assessment of speech
CN107316651B (en) Audio processing method and device based on microphone
CN111489760A (en) Speech signal dereverberation processing method, speech signal dereverberation processing device, computer equipment and storage medium
US20180316328A1 (en) Method and apparatus for audio mixing
CN109389989B (en) Sound mixing method, device, equipment and storage medium
US9917945B2 (en) In-service monitoring of voice quality in teleconferencing
US9478233B2 (en) Speech fragment detection for management of interaction in a remote conference
US20070230372A1 (en) Peer-aware ranking of voice streams
US20230215451A1 (en) Automatic gain control based on machine learning level estimation of the desired signal
CN112351237A (en) Automatic switching decision algorithm for main video of video conference
CN114067822A (en) Call audio processing method and device, computer equipment and storage medium
DE102012102882A1 (en) An electrical device and method for receiving voiced voice signals therefor
US20130297311A1 (en) Information processing apparatus, information processing method and information processing program
CN115831132A (en) Audio encoding and decoding method, device, medium and electronic equipment
CN115132229A (en) Voice quality evaluation system
CN114125136A (en) System and method for adjusting noise level in a communication session
CN1277401C (en) Mixing method of telephone meeting
US9258428B2 (en) Audio bandwidth extension for conferencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant