CN109451194B - Conference sound mixing method and device - Google Patents

Conference sound mixing method and device Download PDF

Info

Publication number
CN109451194B
CN109451194B CN201811139853.8A CN201811139853A CN109451194B CN 109451194 B CN109451194 B CN 109451194B CN 201811139853 A CN201811139853 A CN 201811139853A CN 109451194 B CN109451194 B CN 109451194B
Authority
CN
China
Prior art keywords
mixing
data
conference
period
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811139853.8A
Other languages
Chinese (zh)
Other versions
CN109451194A (en
Inventor
周愚
张一都
马晓莹
胡家军
蔡立安
淳增辉
邓玉华
徐鹏飞
徐翔
白小平
李超然
杜振业
邓松
付建强
李三
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
722th Research Institute of CSIC
Original Assignee
722th Research Institute of CSIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 722th Research Institute of CSIC filed Critical 722th Research Institute of CSIC
Priority to CN201811139853.8A priority Critical patent/CN109451194B/en
Publication of CN109451194A publication Critical patent/CN109451194A/en
Application granted granted Critical
Publication of CN109451194B publication Critical patent/CN109451194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/561Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities by multiplexing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a conference sound mixing method and device, and belongs to the technical field of information processing. The method comprises the following steps: acquiring voice data of each participating user of the conference in the current sound mixing period; the voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the corresponding participating user in the current mixing cycle; determining a conference sound mixing mode based on the voice data of each participating user in the current sound mixing period; and mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user. The audio mixing method provided by the invention can restore the real situation of the conference to the maximum extent, can adapt to complex and changeable conference environments, improves the audio mixing effect of the conference, and improves the user experience.

Description

Conference sound mixing method and device
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a conference sound mixing method and apparatus.
Background
The digital voice conference means that a plurality of conference users at different places can remotely carry out the telephone voice conference. Conference mixing is a key technology of digital voice conferencing. By the conference sound mixing method, each participant can hear the speech of the participant at other places when the participants at different places carry out voice discussion and communication.
The core idea of conference mixing is to overlap the voice data of each user in a conference in time domain. However, due to the limited accuracy of the analog-to-digital converter, the superimposed digital speech signal may suffer from an overflow of the mixed sound. Experts currently studying digital speech processing have proposed many algorithms for solving the overflow of mixed sound, such as a clipping method for forcibly erasing the overflowing speech, an averaging method for attenuating speech data of each user according to the number of users, and a competitive method for selecting users with a small amount of high-energy speech according to the speech energy of participating users to mix sound. The algorithms have advantages and disadvantages in certain use environments respectively, and can meet the conference mixing function of a limited number of users in certain fixed use environments.
In the process of implementing the invention, the inventor finds that the prior art has at least the following problems:
the conference sound mixing methods are only suitable for a single use environment, but in actual use, the use environment of a user and the voice behavior of the user in a conference are not single and fixed but complex and changeable, the conference sound mixing methods can only well complete the conference sound mixing function within a certain application environment range, and when the application environment is complex, the conference sound mixing effect is poor when a conference sound mixing algorithm in a complex application environment is adopted.
Disclosure of Invention
The embodiment of the invention provides a conference sound mixing method and device, which can dynamically select a conference sound mixing algorithm based on a conference application environment and improve the sound mixing effect. The technical scheme is as follows:
in one aspect, a conference mixing method is provided, and the method includes:
acquiring voice data of each participating user of a conference in a current sound mixing period, wherein the voice data of each participating user in the current sound mixing period comprises a plurality of sampling data obtained by continuously sampling voice signals of the corresponding participating user in the current sound mixing period;
determining a conference sound mixing mode based on the voice data of each participating user in the current sound mixing period;
and mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user.
Optionally, the determining a conference mixing mode based on the voice data of each participating user in the current mixing cycle includes:
determining the number of pronunciation users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the sum of the conference voice data based on the voice data of each participant user in the current mixing cycle, wherein the overflow of the conference voice data is that the bit width of a digital signal of the sum of the sampling data of each participant user in the same sampling cycle is greater than the bit width of an encoder, the encoder is used for encoding the mixing data, the maximum overflow value of the conference voice data is the sum of the largest sampling data in the sum of the sampling data of the conference voice data overflow, and the accumulated value of the sum of the conference voice data is the accumulated value of the absolute value of the sum of the sampling data of each sampling cycle;
and determining the conference sound mixing mode of the current sound mixing period based on the number of the sound generation users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the conference voice data.
Optionally, the determining the number of the speaking users of the conference based on the voice data of the respective participating users in the current mixing cycle includes:
determining whether each participating user pronounces in each sub-period of the current mixing period based on the voice data of each participating user in the current mixing period, wherein the mixing period is divided into a plurality of sub-periods;
and determining the maximum user number of the pronunciations in the same sub-period to obtain the pronunciation user number of each participating user in the conference of the current mixing period.
Optionally, the determining, based on the speech data of each participating user in the current mixing cycle, whether each participating user pronounces in each sub-cycle of the current mixing cycle includes:
determining the absolute value of each sampling data of the ith user in the jth sub-period of the current mixing period and the cumulative value of the absolute values of the sampling data;
determining the number of short-time impacts of the ith user in the jth sub-period based on the absolute value of each sampled data of the ith user in the jth sub-period, wherein the short-time impacts are that, in b continuous sampled data, the absolute value of k continuous sampled data in the middle is greater than a set energy value, the absolute value of each sampled data except the k continuous sampled data in the middle is less than the set energy value, k is greater than or equal to 1 and less than or equal to 3, b is less than or equal to m, and m is the number of sampled data of each participating user in the jth sub-period;
and determining whether the ith user pronounces in the jth sub-period based on the short-time impact times of the ith user in the jth sub-period and the accumulated value of the absolute value of each sampling data.
Optionally, the mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixed data to each participating user includes:
processing the voice data of each participating user in the current mixing period based on the determined conference mixing mode;
taking the processed voice data of each participating user in the current sound mixing period as the data to be mixed of each participating user in the current sound mixing period, and mixing the data to be mixed of each participating user in the current sound mixing period;
and outputting mixed sound data to each participating user.
Optionally, the mixing the data to be mixed of the participating users in the current mixing cycle includes:
determining a data group to be mixed of each participating user in the current mixing period based on the data to be mixed of each participating user in the current mixing period, wherein the data group to be mixed of the ith participating user in the current mixing period comprises the data to be mixed of each participating user except the ith participating user in the conference in the current mixing period;
and mixing the audio of the participating users in the current audio mixing period to be mixed with the audio data group respectively to obtain the current audio mixing period, wherein the ith audio mixing data in the audio mixing data of the ith participating user is the sum of the s sampling data in the audio mixing data to be mixed in each of the ith participating user to be mixed with the audio data group.
Optionally, the determining a conference mixing mode of a current mixing cycle based on the number of the sound users of the conference, the overflow number of the conference voice data, the overflow maximum of the conference voice data, and the accumulated value of the conference voice data sum includes:
acquiring the number of times of sound mixing overflow of each participating user during sound mixing of voice data of the previous sound mixing period in the previous sound mixing period and an accumulated value of the sum of sound mixing data, wherein the sound mixing overflow is that the bit width of a digital signal of the sum of sampling data of the same sampling period included by the data to be mixed of each participating user is larger than the bit width of the encoder, and the accumulated value of the sum of sound mixing data is the accumulated value of the absolute value of the sum of sampling data of the same sampling period included by the data to be mixed of each participating user;
and determining the conference sound mixing mode of the current sound mixing period based on the number of the sound users of the conference, the overflow frequency of the conference voice data, the overflow maximum value of the conference voice data, the accumulated value of the conference voice data sum, the overflow frequency of the sound mixing occurring in the previous sound mixing period and the accumulated value of the absolute value of the sound mixing data sum in the previous sound mixing period.
In another aspect, there is provided a conference mixing apparatus, the apparatus including:
the system comprises an acquisition module, a mixing module and a processing module, wherein the acquisition module is used for acquiring voice data of each participating user of a conference in a current mixing cycle, and the voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling voice signals of the corresponding participating user in the current mixing cycle;
the determining module is used for determining a conference mixing mode based on the voice data of each participating user in the current mixing period;
the audio mixing module is used for mixing audio of the voice data of each participating user in the current audio mixing period according to the determined conference audio mixing mode;
and the output module is used for outputting the mixed sound data to each participating user.
Optionally, the determining module is configured to:
determining the number of pronunciation users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the sum of the conference voice data based on the voice data of each participant user in the current mixing cycle, wherein the overflow of the conference voice data is that the bit width of a digital signal of the sum of the sampling data of each participant user in the same sampling cycle is greater than the bit width of an encoder, the encoder is used for encoding the mixing data, the maximum overflow value of the conference voice data is the sum of the largest sampling data in the sum of the sampling data of the conference voice data overflow, and the accumulated value of the sum of the conference voice data is the accumulated value of the absolute value of the sum of the sampling data of each sampling cycle;
and determining the conference sound mixing mode of the current sound mixing period based on the number of the sound generation users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the conference voice data.
Optionally, the determining module is configured to:
determining whether each participating user pronounces in each sub-period of the current mixing period based on the voice data of each participating user in the current mixing period, wherein the mixing period is divided into a plurality of sub-periods;
and determining the maximum user number of the pronunciations in the same sub-period to obtain the pronunciation user number of each participating user in the conference of the current mixing period.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
acquiring voice data of each participating user of the conference in the current sound mixing period; the voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the corresponding participating user in the current mixing cycle; voice data of one mixing period is from a voice signal of a user for a short period of time; determining a conference sound mixing mode based on the voice data of each participating user in the current sound mixing period; the voice signal of the user for a short time can invert the user behavior and the conference application environment for the short time, so that the mixing mode for the short time is determined according to the user behavior and the conference application environment; thus, the conference mixing mode is dynamically updated along with the change of the user behavior and the application environment of the conference; mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user; because the audio mixing data is obtained in the dynamically updated audio mixing mode, the audio mixing mode is dynamically updated along with the change of the behaviors of the user and the application environment of the conference, so that the audio mixing data can restore the real situation of the conference to the maximum extent, can adapt to the complex and changeable conference environment, improves the audio mixing effect of the conference, and improves the user experience.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a conference mixing method according to an embodiment of the present invention;
fig. 2 is a flowchart of a conference mixing method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a voice collecting module according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a voice detection module according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a part of an adaptive conference mix selection module according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a mixing module according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a conference mixing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 illustrates a conference mixing method according to an embodiment of the present invention. Referring to fig. 1, the process flow includes the following steps.
Step 101, acquiring voice data of each participating user of the conference in the current mixing period.
The voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the corresponding participating user in the current mixing cycle. Illustratively, the voice data of the ith participant user of the conference in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the ith participant user in the current mixing cycle. i is a positive integer.
And 102, determining a conference mixing mode based on the voice data of each participating user in the current mixing period.
Illustratively, the conference mixing manner is one of a first mixing manner, a second mixing manner and a third mixing manner. The first mixing mode comprises an averaging method; the second mixing mode comprises a clamping method; the third mixing method includes a combination of a clamp method and an average method, a combination of a competition method and a clamp method, and a combination of a competition method, an average method, and a clamp method.
The clamping method keeps the information of the participation condition of each user in the conference process, does not attenuate voice signals along with the change of the application environment and the user behavior, and has advantages in the application environment with less audio mixing overflow frequency of concurrent users. The averaging method retains information of the participation condition of each user in the conference process, but attenuates the voice signals along with the change of the application environment and the user behavior, and has the advantages when the number of concurrent users is small and the mixing overflow frequency is high. The competitive approach has advantages in environments where there are a large number of concurrent users and competitive speech behavior between users.
And 103, mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user.
The embodiment of the invention acquires the voice data of each participating user of the conference in the current sound mixing period; the voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the corresponding participating user in the current mixing cycle; voice data of one mixing period is from a voice signal of a user for a short period of time; determining a conference sound mixing mode based on the voice data of each participating user in the current sound mixing period; the voice signal of the user for a short time can invert the user behavior and the conference application environment for the short time, so that the mixing mode for the short time is determined according to the user behavior and the conference application environment; thus, the conference mixing mode is dynamically updated along with the change of the user behavior and the application environment of the conference; mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user; because the audio mixing data is obtained in the dynamically updated audio mixing mode, the audio mixing mode is dynamically updated along with the change of the behaviors of the user and the application environment of the conference, so that the audio mixing data can restore the real situation of the conference to the maximum extent, can adapt to the complex and changeable conference environment, improves the audio mixing effect of the conference, and improves the user experience.
Fig. 2 illustrates a conference mixing method according to an embodiment of the present invention. Referring to fig. 2, the process flow includes the following steps.
Step 201, acquiring voice data of each participating user of the conference in the current mixing period.
The mixing period is a time period for determining a mixing mode of the conference, and may be a set value. Illustratively, the mixing period T may be 1 minute. One mixing period includes a plurality of sampling periods, and the voice signals of the participating users of the conference are sampled once in each sampling period. Assuming that f is a voice sampling frequency, the number of sampled data in one mixing period is T × f. Illustratively, T × f may be p × 256 sampling periods, p being a positive integer. And respectively sampling the voice signals of all the participating users in each sampling period to obtain the voice data of all the participating users. The voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the corresponding participating user in the current mixing cycle.
Illustratively, the execution subject of the method provided by the present embodiment may be a server providing a teleconference service. The server can acquire the voice data of n participating users of the L conferences in the mixing period through the voice acquisition module. L and n are both positive integers. The voice acquisition module may be connected to a CODEC (Coder-Decoder) voice chip or an ADC (Analog-to-Digital Converter). The CODEC voice chip and the ADC are both used to transmit digital voice signals of participating users. The data output by the CODEC voice chip is serial, the voice acquisition module needs to perform serial-parallel conversion and coding after sampling, the data of the ADC can be parallel, and the voice acquisition module directly samples without coding.
The structure of the voice acquisition module will be described below by taking the voice acquisition module as an example to acquire voice data from the CODEC. Fig. 3 is a schematic structural diagram of a voice acquisition module. Referring to fig. 3, the voice acquisition module includes a digital signal acquisition unit 11, an a-law encoding to linear encoding unit 12, a buffer unit 13, and a data synchronization unit 14. The digital signal acquisition unit 11 is used for providing a corresponding sampling clock to sample the digital voice signal input by the CODEC according to a corresponding timing diagram; the input digital voice signal may be a PCM (Pulse Code Modulation) signal of the CODEC voice chip. Illustratively, the input digital voice signal may be a PCM signal, the digital signal acquiring unit 11 acquires digital voice data according to the interface timing sequence corresponding to the CODEC voice chip, and then transfers the digital voice data to the a-law code-to-linear prediction coding unit 12 for converting 8-bit a-law PCM coded voice data into 13-bit linear coded voice data, which complies with CCITT g.712 standard of ITU-T. The cache unit 13 includes a user extraction unit 130 and a plurality of cache units. The number of the cache units is larger than or equal to the number of the participating users of all the conferences. The user extraction unit 130 is configured to extract and classify the linear voice data according to different participating users, and input the extracted linear voice data of each participating user to each user cache unit respectively. Referring to fig. 3, the cache units are divided into a user 1 cache unit 131, user 2 cache units 132 and …, and a user n cache unit 13n according to the user. And each buffer unit outputs the linear voice data of each participating user in parallel. The data synchronization unit 14 is configured to delay the data output in parallel, and output linear voice data (i.e., sampled data) of each participating user at the same sampling frequency.
Step 202, determining a conference mixing mode based on the voice data of each participating user in the current mixing period.
This step 202 may include the following steps 2021 to 2025.
Step 2021, determining the number of the speaking users of the conference based on the voice data of each participating user in the current mixing cycle.
The speaking users of the conference may refer to users who speak voice at the same time in one conference.
Step 2021 may include step 2021a and step 2021b as follows.
Step 2021a, determining whether each participating user pronounces in each sub-period of the current mixing period based on the voice data of each participating user in the current mixing period, where the mixing period is divided into several sub-periods.
Step 2021a may include the following steps.
The method comprises the following steps of firstly, determining the absolute value of each sampling data of the ith user in the jth sub-period of the current mixing period and the cumulative value of the absolute values of each sampling data.
Wherein j is a positive integer.
The sampling data is linear voice data, the sampling data is a specific numerical value, the bit width of linear coding is 13 bits, and the range of the sampling data is-4032, so that the absolute value of single sampling data does not exceed 4032.
And secondly, determining the short-time impact times of the ith user in the jth sub-period based on the absolute values of the sampled data of the ith user in the jth sub-period.
The short-time impact is that the absolute values of k continuous sampling data in the middle of the b continuous sampling data are all larger than a set energy value, and the absolute values of all sampling data except the k continuous sampling data in the middle are all smaller than the set energy value; k is more than or equal to 1 and less than or equal to 3, b is more than or equal to m, and m is the number of sampling data of each participating user in the jth sub-period. Illustratively, b may be 43, k may be 3, and the set energy value may be 110. In this case, the short-time bump is such that, of the 43 consecutive sample data, the absolute values of all the 3 consecutive sample data k1, k2, and k3 in the middle are greater than 110, the absolute values of all the 20 consecutive sample data before k1 are less than 110, and the absolute values of all the 20 consecutive sample data after k3 are less than 110.
And thirdly, determining whether the ith user pronounces in the jth sub-period based on the short-time impact times of the ith user in the jth sub-period and the accumulated value of the absolute value of each sampling data.
The third step comprises: and when the accumulated value of the absolute values of the sampling data of the ith participating user in the jth sub-period is greater than the set accumulated value and the short-time impact frequency of the ith participating user in the jth sub-period is greater than the set frequency, determining that the ith user pronounces in the jth sub-period. And when the cumulative value of the absolute values of the sampling data of the ith participating user in the jth sub-period is smaller than or equal to the set cumulative value and/or the short-time impact frequency of the ith participating user in the jth sub-period is smaller than or equal to the set frequency, determining that the ith user does not pronounce in the jth sub-period.
The larger the cumulative value of the absolute values of the sample data is, the larger the energy is, the higher the possibility of being speech is. The greater the number of short impacts, the greater the likelihood of being noise. Illustratively, the set cumulative value may be 40960 and the set number of times may be 4. At this time, when the accumulated value of the absolute values of the sampling data of the ith participating user in the jth sub-period is larger than 40960 and the short-time impact frequency of the ith participating user in the jth sub-period is larger than 4, the ith user is determined to pronounce in the jth sub-period. Otherwise, when the cumulative value of the absolute values of the sampling data of the ith participating user in the jth sub-period is less than or equal to 40960 and/or the short-time impact frequency of the ith participating user in the jth sub-period is less than or equal to 4, determining that the ith user does not pronounce in the jth sub-period.
The server can determine the pronunciation time of each participating user in the current mixing period through the voice detection module. Fig. 4 is a schematic structural diagram of a voice detection module. Referring to fig. 4, the voice detection module includes a plurality of voice detection sub-modules, and each voice detection sub-module is respectively used for voice detection of voice data of a participating user. Each voice detection submodule comprises a short-time impact statistical unit, an accumulator unit, a voice data cache fifo unit, a subtracter unit, a delay unit, a Ptt judgment unit and a voice data unit. Taking the voice detection sub-module corresponding to the 1 st user as an example, the structure of the voice detection sub-module is introduced.
The 1 st user's voice data is simultaneously input to the short-time impulse statistic unit 311, the accumulator unit 312, and the voice data buffer fifo unit 313. The short-time impact counting unit 311 counts the number of times of short-time impact energy of voice every 256 sampling data (1 sub-period). The bit width of the sampling data is 13 bits, so that the dynamic range of the value of the sampling data is-4032. If the absolute values of the three consecutive sample data are all greater than 110, and the absolute values of the first consecutive 20 sample data and the last consecutive 20 sample data of the three sample data are all less than 110, the three-time sampling data are marked as 1 time of short-time impact.
The absolute value of the sampled data may be a measure of the user's speech energy value and the short-term impulse may be a measure of the noise. Through the voice detection module, the user voice energy value and the short-time energy impact frequency in a period of time (256 sampling periods) are counted to generate a Ptt signal, so that the influence of background noise on audio mixing overflow when no voice is generated by a user is weakened, and the problem of pause and frustration of conference audio mixing when the user is suddenly increased or reduced is avoided because each period of user voice generates a Ptt signal.
Accumulator unit 312 is used to accumulate the absolute value of the sampled data with the output of subtractor unit 314. The voice data buffer fifo unit 313 is used for buffering the sample data, and here, a voice data buffer fifo unit with 13bit width and 512 bit depth is used. Subtractor unit 314 obtains the accumulated value output by accumulator unit 312. When the number of sampling data stored in the voice data buffer fifo unit 313 is larger than 256, the subtractor unit 314 reads out one sampling data stored first in the voice data buffer fifo unit 313 (accordingly, the voice data buffer fifo unit 313 will be complemented with one latest sampling data), subtracts the absolute value of the read sampling data from the accumulated value output from the accumulator unit 312, and outputs the result to the accumulator unit 312 and the Ptt determination unit 316, respectively. The Ptt determination unit 316 receives the statistical value of the short-time impact statistics unit 311 and the absolute value output by the subtractor unit 314, and outputs a determination result of Ptt (the Ptt signal indicates that the current sub-period is a sound generation sub-period) once every 256 sampling periods, and if the absolute value output by the subtractor unit 314 is greater than 40960 and the number of short-time impacts is less than 4, it determines that Ptt is valid, and outputs a Ptt signal; otherwise, the energy is not reached, Ptt is invalid, and the Ptt signal is not output. The delay unit 315 delays the sample data output from the voice data buffer fifo unit 313 to keep synchronization with the ptt signal. The voice data unit 317 outputs the sample data delayed in the delay unit 315 in synchronization with the ptt signal.
Specifically, the Ptt judgment unit 316 receives the output values of the Ptt judgment unit 316 and the subtractor unit 314 every one sub-period (256 sampling periods). In the first 256 sampling periods of the mixing, the subtractor unit 314 reads 0 from the voice data buffer fifo unit 313, and therefore the output of the subtractor unit 314 is also 0; the accumulated value in the accumulator unit 312 is the accumulated value of the absolute value of the sampled data. When the sampling period is longer than 256 sampling periods, for example, the 257 th sampling period, in this case, the accumulated value of the absolute values of the 257 sample data output by the accumulator unit 312, the subtractor unit 314 reads the 1 st sample data, subtracts the absolute value of the 1 st sample data from the accumulated value of the absolute values of the 257 sample data, obtains the accumulated value of the absolute values of the 256 sample data, and outputs the accumulated value to the Ptt judgment unit 316. Meanwhile, the short-time impact counting unit 311 outputs the number of short-time impacts within 256 pieces of sampling data to the Ptt determination unit 316. The Ptt determination unit 316 determines whether the user utters sound at a sampling time (one sub-period) corresponding to 256 pieces of sampling data.
Step 2021b, determining the maximum number of users speaking in the same sub-period to obtain the number of users speaking in the conference in the current mixing period.
Illustratively, the number of the pronunciation users in each sub-period of the current mixing period is counted in sequence. And selecting the maximum number of the pronunciation users as the number of the pronunciation users of the conference in the current mixing period.
Step 2022, determining the overflow times of the conference voice data based on the voice data of each participating user in the current mixing cycle.
The overflow of the conference voice data is that the bit width of the digital signal of the sum of the sampling data of each participating user in the same sampling period is larger than the bit width of the encoder, and the encoder is used for encoding the mixed sound data. Since the encoder and the decoder are matched, the bit width of the a-law linear coding block shown in fig. 3 is consistent with the bit width of the encoder.
Step 2023, determining a conference voice data overflow maximum value based on the voice data of each participating user in the current mixing cycle.
The conference voice data overflow maximum value is the sum of the largest sampled data in the sum of the sampled data overflowed by the conference voice data.
Step 2024, determining the cumulative value of the conference voice data sum based on the voice data of each participating user in the current mixing cycle.
The cumulative value of the conference voice data sum is a cumulative value of the absolute value of the sum of the sample data of each sampling period.
The server may determine, through the feature value calculator sub-module included in the adaptive conference mixing selection module, the number of the pronunciation users of the conference in the mixing period (corresponding to step 2021b), the number of overflow times of the conference voice data, the maximum overflow value of the conference voice data, and the accumulated value of the conference voice data sum. Fig. 5 is a schematic structural diagram of the feature value operator module 41. Referring to fig. 5, the feature value operator module 41 includes a conference configuration unit 400, a conference voice accumulation and unit 410, a conference 1 feature value calculation unit 411 to a conference L feature value calculation unit 41L. The conference configuration unit 400 is configured to output configuration information (including correspondence between users and conferences) of L conferences. The conference voice accumulation and summation unit 410 accumulates the sampling data of each participating user in each conference in the current sampling period according to the configuration information of L conferences, obtains the sum of the sampling data of each conference in the current sampling period, and then respectively sends the sum of the sampling data to the conference 1 feature value calculation unit 411 to the conference n feature value calculation unit 41L: the sum of the sample data of the 1 st conference in the current sample period is sent to the conference 1 feature value calculating unit 411, the sum of the sample data of the L th conference in the current sample period is sent to the conference L feature value calculating unit 41L, and so on. The conference 1 feature value calculating unit 411 to the conference L feature value calculating unit 41L respectively receive the accumulated sum value of the conference voice accumulation sum unit 410 and the Ptt signal of the corresponding user, and obtain the number of the speaking users of each conference, the number of overflow times of the conference voice data, the maximum overflow value of the conference voice data, and the accumulated value of the conference voice data sum through calculation.
The working principle of the feature value operator module is introduced by taking the conference 1 feature value computing unit 411 as an example. The conference 1 feature value calculating unit 411 includes an accumulated overflow number statistic unit 4111 (determining the overflow number Mo of conference voice data), an accumulated voice accumulated value unit 4112 (determining the accumulated value Ma of the absolute value of conference voice data sum), an accumulated voice most value unit 4113 (determining the conference voice data overflow maximum value Mm), and a Ptt simultaneous significant maximum statistic unit 4114 (determining the number Mu of uttered users). The Ptt simultaneous effective maximum statistics unit 4114 receives Ptt signals of each participating user in the 1 st conference, and obtains the number of users who simultaneously generate the Ptt signals by counting by judging whether the Ptt signals exist, thereby obtaining the number Mu of users of the maximum concurrent voice data of the conference. Meanwhile, the conference voice accumulation and summation unit 410 accumulates the sampling data of the 1 st conference in the current sampling period according to the configuration information of the L conferences, and sends the accumulation result to the accumulation overflow number counting unit 4111, the accumulation voice accumulation value unit 4112, and the accumulation voice most value unit 4113. The accumulated voice accumulated value unit 4112 and the accumulated voice maximum value unit 4113 recalculate the overflow times Mo and the maximum overflow value Mm of the accumulated value of the current mixing period every T × f voice data (time T); the accumulated overflow number counting unit 4111 newly counts an accumulated value Ma of the absolute value of the accumulated value every T periods.
Step 2025, determining the conference mixing mode of the current mixing cycle based on the number of the sound users of the conference, the overflow times of the conference voice data, the overflow maximum value of the conference voice data, and the accumulated value of the conference voice data sum.
The embodiment of the invention provides two modes for determining the conference mixing mode of the current mixing period.
In the first embodiment, the conference mixing mode is determined according to the values of four parameters, namely the number Mu of the sound users of the conference, the overflow frequency Mo of the conference voice data, the overflow maximum value Mm of the conference voice data and the cumulative value Ma of the conference voice data.
In the second embodiment, when determining the conference mixing mode of the current mixing period, in addition to the voice data of the current mixing period, the effect of mixing in the previous mixing period is also considered. Therefore, the accuracy of the determined mixing mode can be improved. Based on the second embodiment, step 2025 may include step 2025a and step 2025 b.
Step 2025a, obtaining the number of times of mixing overflow occurring when the previous mixing cycle mixes the voice data of each participating user in the previous mixing cycle, and the cumulative value of the sum of the mixing data.
The mixing overflow is that the bit width of the digital signal of the sum of the sampling data of the same sampling period included in the data to be mixed of each participating user is greater than the bit width of the encoder. The aggregate value of the mixed sound data sum is an aggregate value of absolute values of the sums of the sampled data of the same sampling period included in the data to be mixed of each participating user.
Step 2025b, determining the conference mixing mode of the current mixing period based on the number of the sound users of the conference, the overflow number of the conference voice data, the overflow maximum value of the conference voice data, the accumulated value of the conference voice data sum, the number of times of mixing overflow occurring in the previous mixing period, and the accumulated value of the mixing data sum in the previous mixing period.
Referring to fig. 5, the feature value operator module 41 further includes a feedback parameter unit 4115. The feedback parameter unit 4115 receives data to be mixed in the previous mixing cycle (i.e., T × f sampled data) of each participating user in the 1 st conference, and calculates the number of feedback overflows Q1 (the number of mixing overflows) and the accumulated value of feedback voice sums Q2 (the accumulated value of the mixed data sums) in the previous mixing cycle. The feature value operator module 41 further includes a feature value calculation unit 4116. The eigenvalue calculation unit 4116 collects the overflow number Mo, the maximum overflow value Mm, the voice accumulated and accumulated value Ma, the concurrent user Mu, the feedback overflow number Q1, and the feedback voice accumulated and accumulated value Q2, and indexes the audio mixing mode and the audio mixing parameters of the audio mixing mode in the pre-established correspondence between the eigenvalues and the audio mixing mode. Referring to fig. 6, the adaptive conference mix selection module further includes a selector unit 42. The selector unit 42 selects conference mixing using an averaging method, a clamping method, a competition method, or a combination algorithm (A, x and vu) according to the conference feature values (Mo, Mm, Ma, and Mu). In the present embodiment, a represents a conference mixing algorithm, x represents a mixing attenuation parameter, and vu represents an effective user of mixing. x and vu represent mixing parameters, wherein x is the mixing parameter of an average method or a combination method comprising the average method; vu is a mixing parameter of a competitive method or a combination method including competition. A ═ 1 represents the use of the clamp method, a ═ 2 represents the use of the average method, a ═ 4 represents the use of the combination of the average method and the clamp method, a ═ 6 represents the use of the combination of the competition method and the clamp method, and a ═ 7 represents the use of the combination of the competition method, the average method, and the clamp method.
Illustratively, the correspondence relationship between the feature values and the mixing manner includes the following.
When the system is initialized, a large number of users cannot participate in conference voice immediately, so that Q1, Q2, Mu, Mm, Mo and Ma may all be 0 in the first few mixing cycles, where a is 1, x is 1, and vu is Mu. It should be noted that, when determining the mixing manner in the 1 st mixing cycle, since the 1 st mixing cycle is the initial mixing cycle and there is no mixing process before this, when determining the conference mixing manner in the 1 st mixing cycle, Q1 and Q2 may be both set values, and the set value may be 0.
As the user participates in the conference, Q1, Q2, Mu, Mm, Mo, and Ma will be updated. Illustratively, when the conference mixing mode of the 2 nd mixing cycle is determined, Q1 and Q2 are determined based on the voice data of the 1 st mixing cycle and the mixing mode, and Mu, Mm, Mo and Ma are determined based on the voice data of the 2 nd mixing cycle. When the conference mixing mode of the 3 rd mixing period is determined, Q1 and Q2 are determined based on the voice data of the 2 nd mixing period and the mixing mode, and Mu, Mm, Mo and Ma are determined based on the voice data of the 3 rd mixing period. When the conference mixing mode of the 4 th mixing period and later is determined, the rest can be done.
In the current mixing period, when Q1< T f/80 and Mo < T f/80, the voice behavior of the user in the conference is considered to be in a controllable range most of the time, and almost no voice signal overflows, and at this time, regardless of the values of Q2, Mu, Mm and Ma, A is 1, x is 1, and vu is Mu.
In the current mixing period, when T × f/80< Mo, if the current mixing period is the 1 st mixing period and | Mm | <2 × 4032 and Mu >4 are satisfied, then regardless of the values of Q1, Q2 and Ma, a is 6, x is 1, and vu is 3; if the current mixing cycle is not the 1 st mixing cycle, and Q1< ═ T f/80 and (Q2 x) > (Ma 0.9) are satisfied, then regardless of the values of Mm and Mu, a is 6, x is 1, and vu is 3; if the current mixing cycle is not the 1 st mixing cycle, and Q1< ═ T f/80 and (Q2 x) > (Ma 0.9) are not satisfied, then regardless of the values of Mm and Mu, a is 2, x is 2, and vu is Mu.
In the current mixing cycle, when T × f/80< Mo, | Mm | <2 × 4032 and Mu ≦ 4 are satisfied, in this case, regardless of values of Ma, Q1, and Q2, a ≦ 2, x ≦ 2, and vu ≦ Mu.
In the current mixing cycle, when T × f/80< Mo, if the current mixing cycle is the 1 st mixing cycle and satisfies 4 × 4032> | Mm | >2 | > 4032 and Mu < ═ 4, then regardless of the values of Ma, Q1 and Q2, a is 4, x is 2, and vu is Mu. If the current mixing period is not the 1 st mixing period and T f/80< Q1< T f/20 is satisfied, then A is equal to A ', x is equal to x ' +1, vu is equal to vu ' regardless of the values of Ma, Mm, Mu and Q2; if the current mixing cycle is not the 1 st mixing cycle and Q1< ═ T × f/80 is satisfied, no matter how Ma, Mm, Mu and Q2 take values, a is a ', x is x ', and vu is vu '. A ' is the mixing mode of the previous mixing cycle, x ' is the mixing attenuation parameter of the mixing mode of the previous mixing cycle, and vu ' is the effective user number of the mixing mode of the previous mixing cycle.
In the current mixing period, when T × f/80< Mo, if the current mixing period is the 1 st mixing period and satisfies 4 × 4032> | Mm | >2 | > 4032 and Mu >4, then regardless of the values of Q1, Q2 and Ma, a is 6, x is 1, vu is 3; if the current mixing cycle is not the 1 st mixing cycle, Q1< ═ T f/80 and (Q2 x') > (Ma 0.9) are satisfied, then regardless of the values of Ma, Mu and Mm, a is 6, x is 1 and vu is 3; if the current mixing cycle is not the 1 st mixing cycle, and Q1< ═ T f/80 is satisfied and (Q2 x ') > (Ma 0.9) is not satisfied, then regardless of the values of Mu and Mm, a is a', x is x ', and vu is vu'; if the current mixing period is not the 1 st mixing period and Q1< ═ T × f/80 is not satisfied, then a is 4, x is x' +1, and vu is Mu, regardless of the values of Mu, Mm, Ma, and Q2.
In the current mixing cycle, when T × f/80< Mo, if the current mixing cycle is the 1 st mixing cycle and satisfies | Mm | >4 | > 4032, Mu >4 and | Ma | <3 | -4032, then regardless of the values of Q1 and Q2, a is 7, x is 2, vu is 3; if the current mixing cycle is not the 1 st mixing cycle, and Q1< ═ T × f/80 and (Q2 ×) > (Ma × 0.9) are satisfied, then, regardless of the values of Mm, Mu and Ma, a ═ a ', x ═ x ', vu ═ vu '; if the current mixing cycle is not the 1 st mixing cycle, does not satisfy Q1< ═ T f/80, and satisfies (Q2 x) > (Ma 0.9), a ═ a ', x ═ x ' +1, vu ═ vu '; if the current mixing cycle is not the 1 st mixing cycle, Q1< ═ T f/80 is satisfied, and (Q2 x) > (Ma 0.9) is not satisfied, a ═ a ', x ═ x ', vu ═ vu ' + 1.
In the current mixing cycle, when T × f/80< Mo, if the current mixing cycle is the 1 st mixing cycle and satisfies | Mm | >4 | > 4032 and Mu >4 and | Ma | > (3 ═ 4032), then regardless of the values of Q1 and Q2, a is 4, x is 3, and vu is Mu; if the current mixing period is not the 1 st mixing period and Q1< ═ T f/80 is satisfied, no matter the values of Q2, Mu, Ma and Mm, a is a ', x is x ', vu is vu '; if the current mixing period is not the 1 st mixing period and Q1< ═ T f/80 is not satisfied, then a is equal to a ', x is equal to x ' +1, and vu is equal to vu ' regardless of the values of Q2, Mu, Ma, and Mm.
It should be noted that the determination of the mixing period T may be determined based on the variation law of Ma. If T is too large, it deviates from the design goal of the real-time update algorithm, and if T is too small, the eigenvalue calculation may be inaccurate. Illustratively, T may be set to 1 minute.
And step 203, mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode.
This step 203 may include: step A, processing voice data of each participating user in the current sound mixing period based on the determined conference sound mixing mode; and step B, taking the processed voice data of each participating user in the current sound mixing period as the data to be mixed of each participating user in the current sound mixing period, and mixing the data to be mixed of each participating user in the current sound mixing period.
Step a may comprise: when the determined conference mixing mode is an averaging method or a combination method comprising the averaging method, averaging the sampling data included in the voice data of each participating user in the current mixing period; when the determined conference mixing mode is a competition method or a combination method comprising the competition method, carrying out zero clearing processing on sampling data included by the voice data of each participating user in the current mixing period; and when the determined conference mixing mode is a clamping method or a combination method comprising the clamping method, not processing the sampling data included in the voice data of each participating user in the current mixing period.
The averaging processing of the sampled data included in the speech data of each participating user in the current mixing cycle includes: each sample data is multiplied by the mix attenuation parameter x to reduce the value of each sample data.
The above zero clearing processing on the sampling data included in the speech data of each participating user in the current mixing cycle includes: calculating the absolute value of each sampling data, and determining vu sampling data with the maximum absolute value in each sampling period; the values of the sample data in each sample period except for the vu sample data are set to 0.
When the mixing method is an averaging method or a combination method including an averaging method, averaging the sampled data; and when the sound mixing mode comprises a combination method of a competition method, carrying out zero clearing processing on the sampling data.
The above mixing the data to be mixed of each participating user in the current mixing period includes: firstly, determining a data group to be mixed of each participating user in the current mixing period based on the data to be mixed of each participating user in the current mixing period; the data group to be mixed of the ith participating user in the current mixing period comprises the data to be mixed of each participating user except the ith participating user in the participating users of the conference in the current mixing period. Secondly, mixing the sound of the data group to be mixed of each participating user in the current sound mixing period respectively to obtain the sound mixing data of each participating user in the current sound mixing period; the s-th mixing data in the mixing data of the i-th participating user is the sum of the s-th sampling data in each mixing data to be mixed in the mixing data group of the i-th participating user.
Illustratively, when the determined conference mixing manner is a clamping method or a combination method including the clamping method, step 203 may further include: and C, performing overflow processing on the mixed sound data of each participating user in the current mixed sound period.
The overflow processing of the mixed sound data of each participating user in the current mixing cycle includes: comparing the bit width of each mixed sound data in the mixed sound data with the bit width of the encoder; outputting the original data when the bit width of the mixed data falls within the bit width range of the encoder; when the bit width of the audio mixing data is out of the bit width range of the encoder and is larger than the data corresponding to the maximum bit width, outputting the data corresponding to the maximum bit width of the encoder; and outputting the data corresponding to the minimum bit width of the encoder when the bit width of the mixed sound data is out of the bit width range of the encoder and is smaller than the data corresponding to the minimum bit width.
When the mixing method is a clamp method or a combination method including a clamp method, overflow processing is performed on the sample data.
The server can realize sound mixing through the sound mixing module. Referring to fig. 6, the mixing module includes a data preprocessing unit 43, a mixing unit 44, and a data post-processing unit 45. The voice data of n users output by the voice detection module and the corresponding voice recognition signals Ptt are respectively input to the feature value operator module 41 and the data preprocessing unit 43. The selector unit 42 controls the data pre-processing unit 43 and the data post-processing unit 45 to process the user voice data and to process the user mix data, respectively. The data preprocessing unit 43 receives the input voice data of n users and calculates the voice and overflow times (Q1) and the voice and accumulated value (Q2) of each conference in real time, when a certain conference has n users and an average method is adopted, the voice value of the corresponding user in the conference is divided by n-1, when a certain conference adopts a competition method, the corresponding user is reserved according to the control of the selector unit 42, and the voice data of other users in the conference is forced to be 0. The data preprocessing unit 43 is configured to perform averaging and/or zero clearing on sample data included in the speech data of each participating user in the current mixing cycle, and after the averaging and/or zero clearing, on one hand, input the speech data of each participating user after the averaging and/or zero clearing in the current mixing cycle to the mixing unit 44 as the data to be mixed of each participating user in the current mixing cycle, and on the other hand, calculate Q1 and Q2 based on the speech data after the averaging and/or zero clearing, and output the data to the feature value operator module 41. The audio mixing unit 44 receives the user audio data processed by the data preprocessing unit 43, and sends the audio data of all participating users except the user to the parallel adder according to the configuration information of the L conferences output by the conference configuration unit 400 by using the parallel adder, so as to obtain the conference audio mixing result of each user, and outputs the conference audio mixing result to the data postprocessing unit 45. The data post-processing unit 45 judges the mixed sound data output by each sampling period of each user when using the clamping method according to the control of the selector unit 42, and if the mixed sound data exceeds the 13-bit width precision, replaces the mixed sound data according to the maximum value and the minimum value of the 13-bit width precision digital signal, and finally outputs the processed mixed sound data.
And step 204, outputting the mixed voice data to each participating user.
Step 204 may include: and respectively outputting the mixed sound data of each participating user in the current mixed sound period to each participating user.
The server can output the mixed sound data of each participating user through the voice output module. The voice output module firstly carries out conversion from linear coding to A-law coding on the mixed voice data of each participating user, then carries out allocation on the time slot on the mixed voice data after conversion according to the requirement of a CODEC voice chip or an ADC on the time slot, and finally outputs the allocated mixed voice data to the CODEC voice chip or the ADC. The system external interface and the system internal processing interface are isolated through the voice acquisition module and the voice output module, and the system external codec or other AD converters are changed, so that the internal processing process of the whole system is not influenced, and the system adaptability is enhanced.
The embodiment of the invention obtains Ptt signals and voice data through voice detection, calculates and obtains the maximum number Mu of users, the overflow times Mo, the maximum overflow value Mm and the accumulated value Ma of the sum of all conference voice data, which simultaneously generate voice of each conference, and the overflow times Q1 and the accumulated value Q2 of the sum of all conference voice data fed back by previous audio mixing, indexes and finds out the characteristic value A of a conference audio mixing algorithm, the characteristic value x of an attenuation parameter and the effective user characteristic value vu, and automatically and dynamically selects different conference audio mixing algorithms and algorithm related parameters to carry out conference audio mixing on each user according to the characteristic values, thereby realizing the function of carrying out conference audio mixing by adaptively calling the most appropriate algorithm and parameter under different environments.
Fig. 7 shows a conference mixing apparatus provided in an embodiment of the present invention, and referring to fig. 7, the apparatus includes: an obtaining module 301, a determining module 302, a mixing module 303 and an output module 304.
The obtaining module 301 is configured to obtain voice data of each participating user in the conference in the current mixing cycle, where the voice data of each participating user in the current mixing cycle includes multiple sampling data obtained by continuously sampling voice signals of the corresponding participating user in the current mixing cycle.
The determining module 302 is configured to determine a conference mixing mode based on the voice data of each participating user in the current mixing cycle.
And the sound mixing module 303 is configured to mix the voice data of each participating user in the current sound mixing period according to the determined conference sound mixing manner.
An output module 304, configured to output the mixing data to each participating user.
Illustratively, the obtaining module 301 may be implemented by a voice collecting module shown in fig. 3; the determining module 302 can be implemented by using the voice detecting module and the adaptive conference mix selecting module shown in fig. 4 and 5; the mixing module 303 can be implemented by using the mixing module shown in fig. 6.
Optionally, the determining module 302 is configured to determine, based on the voice data of each participating user in the current mixing cycle, the number of sound users in the conference, the number of overflow times of the conference voice data, a maximum overflow value of the conference voice data, and an accumulated value of a sum of the conference voice data, where the overflow of the conference voice data is that a bit width of a digital signal of a sum of sample data of each participating user in the same sampling cycle is greater than a bit width of an encoder, the encoder is configured to encode the mixing data, the maximum overflow value of the conference voice data is the largest sum of sample data in the sum of sample data of the overflow of the conference voice data, and the accumulated value of the sum of the conference voice data is the accumulated value of an absolute value of the sum of sample data of each; and determining the conference sound mixing mode of the current sound mixing period based on the number of the sound users of the conference, the overflow times of the conference voice data, the overflow maximum value of the conference voice data and the accumulated value of the conference voice data.
Optionally, the determining module 302 is configured to determine, based on the voice data of each participating user in the current mixing cycle, whether each participating user pronounces in each sub-cycle of the current mixing cycle, where the mixing cycle is divided into several sub-cycles; and determining the maximum user number of the pronunciations in the same sub-period to obtain the pronunciation user number of each participating user in the conference of the current mixing period.
Optionally, the determining module 302 is configured to determine an absolute value of each sampled data of the ith user in the jth sub-period of the current mixing period and a cumulative value of the absolute values of each sampled data; determining the number of short-time impact times of the ith user in the jth sub-period based on the absolute value of each sampling data of the ith user in the jth sub-period, wherein the short-time impact is that, in the b continuous sampling data, the absolute values of k continuous sampling data in the middle are all larger than a set energy value, the absolute values of all sampling data except the k continuous sampling data in the middle are all smaller than the set energy value, k is more than or equal to 1 and less than or equal to 3, b is less than or equal to m, and m is the number of the sampling data of each participating user in the jth sub-period; and determining whether the ith user pronounces in the jth sub-period based on the short-time impact times of the ith user in the jth sub-period and the accumulated value of the absolute values of the sampling data.
Optionally, the mixing module 303 is configured to process, based on the determined conference mixing manner, the voice data of each participating user in the current mixing period; and taking the processed voice data of each participating user in the current mixing period as the data to be mixed of each participating user in the current mixing period, and mixing the data to be mixed of each participating user in the current mixing period.
Optionally, the mixing module 303 is configured to determine, based on data to be mixed of each participating user in the current mixing cycle, a data group to be mixed of each participating user in the current mixing cycle, where the data group to be mixed of the ith participating user in the current mixing cycle includes data to be mixed of each participating user except the ith participating user in the conference in the current mixing cycle; and respectively mixing the audio of each participating user in the current audio mixing period to obtain the audio mixing data of each participating user in the current audio mixing period, wherein the s-th audio mixing data in the audio mixing data of the i-th participating user is the sum of the s-th sampling data in each audio mixing data to be mixed in the audio mixing data group of the i-th participating user.
Optionally, the determining module 302 is configured to obtain the number of times of overflow of mixed sound when the previous mixing cycle mixes the voice data of each participating user in the previous mixing cycle, and an accumulated value of the sum of the mixed sound data, where the overflow of mixed sound is that a bit width of a digital signal of a sum of sampled data of the same sampling cycle included in the data to be mixed of each participating user is greater than a bit width of an encoder, and the accumulated value of the sum of mixed sound data is that the accumulated value of an absolute value of a sum of sampled data of the same sampling cycle included in the data to be mixed of each participating user; and determining the conference mixing mode of the current mixing period based on the number of the sound users of the conference, the overflow times of the conference voice data, the overflow maximum value of the conference voice data, the accumulated value of the conference voice data sum, the overflow times of the mixing in the last mixing period and the accumulated value of the absolute value of the mixing data sum in the last mixing period.
Illustratively, the above device may be implemented by an FPGA (Field-Programmable Gate Array) chip. The audio mixing system is realized by adopting an FPGA chip, the parallelism of the whole audio mixing system is improved, and the delay of the system is reduced.
The embodiment of the invention is realized by each module, and the voice data of each participating user of the conference in the current sound mixing period is obtained; the voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling the voice signal of the corresponding participating user in the current mixing cycle; voice data of one mixing period is from a voice signal of a user for a short period of time; determining a conference sound mixing mode based on the voice data of each participating user in the current sound mixing period; the voice signal of the user for a short time can invert the user behavior and the conference application environment for the short time, so that the mixing mode for the short time is determined according to the user behavior and the conference application environment; thus, the conference mixing mode is dynamically updated along with the change of the user behavior and the application environment of the conference; mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user; because the audio mixing data is obtained in the dynamically updated audio mixing mode, the audio mixing mode is dynamically updated along with the change of the behaviors of the user and the application environment of the conference, so that the audio mixing data can restore the real situation of the conference to the maximum extent, can adapt to the complex and changeable conference environment, improves the audio mixing effect of the conference, and improves the user experience. In addition, the modular design idea based on functions is adopted, and the flexibility of adding or cutting each module is increased.
It should be noted that: in the conference mixing apparatus provided in the above embodiment, only the division of each function module is exemplified when mixing, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the functions described above. In addition, the conference mixing apparatus and the conference mixing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A conference mixing method, characterized in that the method comprises:
acquiring voice data of each participating user of a conference in a current sound mixing period, wherein the voice data of each participating user in the current sound mixing period comprises a plurality of sampling data obtained by continuously sampling voice signals of the corresponding participating user in the current sound mixing period;
determining the number of pronunciation users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the sum of the conference voice data based on the voice data of each participant user in the current mixing cycle, wherein the overflow of the conference voice data is that the bit width of a digital signal of the sum of the sampling data of each participant user in the same sampling cycle is greater than the bit width of an encoder, the encoder is used for encoding the mixing data, the maximum overflow value of the conference voice data is the sum of the largest sampling data in the sum of the sampling data of the conference voice data overflow, and the accumulated value of the sum of the conference voice data is the accumulated value of the absolute value of the sum of the sampling data of each sampling cycle;
determining a conference sound mixing mode of the current sound mixing period based on the number of the sound users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the conference voice data;
and mixing the voice data of each participating user in the current mixing period according to the determined conference mixing mode, and outputting the mixing data to each participating user.
2. The method of claim 1, wherein the determining the number of the speaking users of the conference based on the voice data of the respective participating users in the current mixing cycle comprises:
determining whether each participating user pronounces in each sub-period of the current mixing period based on the voice data of each participating user in the current mixing period, wherein the mixing period is divided into a plurality of sub-periods;
and determining the maximum user number of the pronunciations in the same sub-period to obtain the pronunciation user number of each participating user in the conference of the current mixing period.
3. The method according to claim 2, wherein the determining whether each of the participating users pronounces in each sub-period of the current mixing period based on the voice data of each of the participating users in the current mixing period comprises:
determining the absolute value of each sampling data of the ith user in the jth sub-period of the current mixing period and the cumulative value of the absolute values of the sampling data;
determining the number of short-time impacts of the ith user in the jth sub-period based on the absolute value of each sampled data of the ith user in the jth sub-period, wherein the short-time impacts are that, in b continuous sampled data, the absolute value of k continuous sampled data in the middle is greater than a set energy value, the absolute value of each sampled data except the k continuous sampled data in the middle is less than the set energy value, k is greater than or equal to 1 and less than or equal to 3, b is less than or equal to m, and m is the number of sampled data of each participating user in the jth sub-period;
and determining whether the ith user pronounces in the jth sub-period based on the short-time impact times of the ith user in the jth sub-period and the accumulated value of the absolute value of each sampling data.
4. The method according to claim 1, wherein the mixing the voice data of each participating user in the current mixing cycle according to the determined conference mixing mode, and outputting the mixed data to each participating user comprises:
processing the voice data of each participating user in the current mixing period based on the determined conference mixing mode;
taking the processed voice data of each participating user in the current sound mixing period as the data to be mixed of each participating user in the current sound mixing period, and mixing the data to be mixed of each participating user in the current sound mixing period;
and outputting mixed sound data to each participating user.
5. The method according to claim 4, wherein the mixing the data to be mixed of the participating users in the current mixing cycle comprises:
determining a data group to be mixed of each participating user in the current mixing period based on the data to be mixed of each participating user in the current mixing period, wherein the data group to be mixed of the ith participating user in the current mixing period comprises the data to be mixed of each participating user except the ith participating user in the conference in the current mixing period;
and mixing the audio of the participating users in the current audio mixing period to be mixed with the audio data group respectively to obtain the current audio mixing period, wherein the ith audio mixing data in the audio mixing data of the ith participating user is the sum of the s sampling data in the audio mixing data to be mixed in each of the ith participating user to be mixed with the audio data group.
6. The method of claim 5, wherein determining the conference mixing mode of the current mixing cycle based on the number of the sound users of the conference, the number of overflow times of the conference voice data, the maximum overflow value of the conference voice data, and the accumulated value of the conference voice data sum comprises:
acquiring the number of times of sound mixing overflow of each participating user during sound mixing of voice data of the previous sound mixing period in the previous sound mixing period and an accumulated value of the sum of sound mixing data, wherein the sound mixing overflow is that the bit width of a digital signal of the sum of sampling data of the same sampling period included by the data to be mixed of each participating user is larger than the bit width of the encoder, and the accumulated value of the sum of sound mixing data is the accumulated value of the absolute value of the sum of sampling data of the same sampling period included by the data to be mixed of each participating user;
and determining the conference sound mixing mode of the current sound mixing period based on the number of the sound users of the conference, the overflow frequency of the conference voice data, the overflow maximum value of the conference voice data, the accumulated value of the conference voice data sum, the overflow frequency of the sound mixing occurring in the previous sound mixing period and the accumulated value of the absolute value of the sound mixing data sum in the previous sound mixing period.
7. A conference mixing apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a mixing module and a processing module, wherein the acquisition module is used for acquiring voice data of each participating user of a conference in a current mixing cycle, and the voice data of each participating user in the current mixing cycle comprises a plurality of sampling data obtained by continuously sampling voice signals of the corresponding participating user in the current mixing cycle;
the determining module is used for determining the number of the pronunciation users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the sum of the conference voice data based on the voice data of each participant user in the current mixing period, wherein the overflow of the conference voice data is that the bit width of a digital signal of the sum of the sampling data of each participant user in the same sampling period is greater than the bit width of an encoder, the encoder is used for encoding the mixing data, the maximum overflow value of the conference voice data is the sum of the largest sampling data in the sum of the sampling data of the conference voice data overflow, and the accumulated value of the sum of the conference voice data is the accumulated value of the absolute value of the sum of the sampling data of each sampling period;
determining a conference sound mixing mode of the current sound mixing period based on the number of the sound users of the conference, the overflow times of the conference voice data, the maximum overflow value of the conference voice data and the accumulated value of the conference voice data;
the audio mixing module is used for mixing audio of the voice data of each participating user in the current audio mixing period according to the determined conference audio mixing mode;
and the output module is used for outputting the mixed sound data to each participating user.
8. The apparatus of claim 7, wherein the determining module is configured to:
determining whether each participating user pronounces in each sub-period of the current mixing period based on the voice data of each participating user in the current mixing period, wherein the mixing period is divided into a plurality of sub-periods;
and determining the maximum user number of the pronunciations in the same sub-period to obtain the pronunciation user number of each participating user in the conference of the current mixing period.
CN201811139853.8A 2018-09-28 2018-09-28 Conference sound mixing method and device Active CN109451194B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811139853.8A CN109451194B (en) 2018-09-28 2018-09-28 Conference sound mixing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811139853.8A CN109451194B (en) 2018-09-28 2018-09-28 Conference sound mixing method and device

Publications (2)

Publication Number Publication Date
CN109451194A CN109451194A (en) 2019-03-08
CN109451194B true CN109451194B (en) 2020-11-24

Family

ID=65544587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811139853.8A Active CN109451194B (en) 2018-09-28 2018-09-28 Conference sound mixing method and device

Country Status (1)

Country Link
CN (1) CN109451194B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114650388A (en) * 2022-03-22 2022-06-21 中国电子科技集团公司第五十四研究所 Multipath sound mixing method based on ARM platform

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968995A (en) * 2012-11-16 2013-03-13 新奥特(北京)视频技术有限公司 Sound mixing method and device of audio signal
CN103259943A (en) * 2012-02-21 2013-08-21 深圳市东进软件开发有限公司 PSTN teleconference sound mixing method
CN104219013A (en) * 2014-09-01 2014-12-17 厦门亿联网络技术股份有限公司 Method for multi-microphone sound mixing of video conference system
CN104539816A (en) * 2014-12-25 2015-04-22 广州华多网络科技有限公司 Intelligent voice mixing method and device for multi-party voice communication
CN105304079A (en) * 2015-09-14 2016-02-03 上海可言信息技术有限公司 Multi-party call multi-mode speech synthesis method and system
CN105704423A (en) * 2014-11-24 2016-06-22 中兴通讯股份有限公司 Voice output method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2082396A1 (en) * 2007-10-17 2009-07-29 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio coding using downmix

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103259943A (en) * 2012-02-21 2013-08-21 深圳市东进软件开发有限公司 PSTN teleconference sound mixing method
CN102968995A (en) * 2012-11-16 2013-03-13 新奥特(北京)视频技术有限公司 Sound mixing method and device of audio signal
CN104219013A (en) * 2014-09-01 2014-12-17 厦门亿联网络技术股份有限公司 Method for multi-microphone sound mixing of video conference system
CN105704423A (en) * 2014-11-24 2016-06-22 中兴通讯股份有限公司 Voice output method and device
CN104539816A (en) * 2014-12-25 2015-04-22 广州华多网络科技有限公司 Intelligent voice mixing method and device for multi-party voice communication
CN105304079A (en) * 2015-09-14 2016-02-03 上海可言信息技术有限公司 Multi-party call multi-mode speech synthesis method and system

Also Published As

Publication number Publication date
CN109451194A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
US5533112A (en) Volume control in digital teleconferencing
NL7906413A (en) CHAIN AND METHOD FOR PROCESSING VOICE SIGNALS.
EP2786552B1 (en) Method to select active channels in audio mixing for multi-party teleconferencing
CN105304079A (en) Multi-party call multi-mode speech synthesis method and system
Noll A comparative study of various quantization schemes for speech encoding
CN109451194B (en) Conference sound mixing method and device
CN1946029A (en) Method and its system for treating audio signal
CN111583942A (en) Coding rate control method and device for voice conversation and computer equipment
Nikolić et al. Low complex forward adaptive loss compression algorithm and its application in speech coding
SU1210677A3 (en) Method of digital conference call and device for effcting same
US5509102A (en) Voice encoder using a voice activity detector
EP2158753B1 (en) Selection of audio signals to be mixed in an audio conference
CN104469032A (en) Sound mixing processing method and system
Nikolic et al. Simple solution for designing the piecewise linear scalar companding quantizer for Gaussian source
US7046792B2 (en) Transmit/receive arbitrator
Petkovic et al. Switched Scalar Optimal u-law Quantization with Adaptation Performed to both the Variance and the Distribution of Speech Signal
JP4135621B2 (en) Receiving apparatus and method
Nikolić et al. Linearization of optimal compressor function and design of piecewise linear compandor for Gaussian source
CN115346543B (en) Audio processing method, model training method, device, equipment, medium and product
Tashev Offline voice activity detector using speech supergaussianity
CN100488212C (en) Inputting and outputting noise controller of consumer terminal
CN105118521B (en) Noise suppression method, apparatus and system
AU703845B2 (en) Telephone conference arrangement
CN118447869A (en) User voice switching method and device, electronic equipment and storage medium
CN116132414A (en) Audio stream sending method, system, server and audio sending terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant