WO2016082579A1 - 语音输出方法及装置 - Google Patents

语音输出方法及装置 Download PDF

Info

Publication number
WO2016082579A1
WO2016082579A1 PCT/CN2015/086171 CN2015086171W WO2016082579A1 WO 2016082579 A1 WO2016082579 A1 WO 2016082579A1 CN 2015086171 W CN2015086171 W CN 2015086171W WO 2016082579 A1 WO2016082579 A1 WO 2016082579A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
terminal
mixing
channel
predetermined format
Prior art date
Application number
PCT/CN2015/086171
Other languages
English (en)
French (fr)
Inventor
李星
黄家典
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016082579A1 publication Critical patent/WO2016082579A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present invention relates to the field of communications, and in particular to a voice output method and apparatus.
  • Video conferencing is a common tool for communication in all walks of life. With the rise of the application of telepresence technology, this multi-screen multi-channel video conference pays more attention to the real face-to-face communication between people in the conference, so that participants have an immersive feeling. At present, the telepresence conference system has been applied to administrative conferences, remote collaborative office, aviation navigation, and telemedicine.
  • the Micro Controller Unit not only needs to support telepresence terminal video conferencing (three channels), but also is fully compatible with ordinary video conferencing (single sound). Road and two-channel stereo). Therefore, the types of terminal audio data that MCUs need to support at present are mono, two-channel stereo and three-channel.
  • Conference voice mixing refers to the ability of multiple conference sites to simultaneously speak when multiple terminals participate in a conference, and communicates information of all speakers in different conference sites to all participants in the conference.
  • the MCU needs to perform upmixing or downmixing processing on each terminal before mixing, so that the audio data of each terminal in the conference system maintains a uniform format and is processed in the MCU. After the audio data is returned to the terminal, it is necessary to upmix or downmix each terminal.
  • the MCU needs to support communication between mono, two-channel, and three-channel terminals.
  • the MCU receives the terminal data with different channel audio formats, it needs to convert the audio format of the conference into a unified format, and the low channel to the high channel number is called upmix, and the high channel number audio is to the low channel.
  • Digital audio conversion is called downmixing.
  • the existing upper mixing and downmixing techniques are:
  • Upmix mode mono->two-channel: 0.7071*M->L, 0.7071*M->R;
  • the audio data of each terminal is upmixed or downmixed, and then the audio data of the multi-channel terminal is mixed, it will cause a great burden on the MCU, and also cause a poor voice effect, and the number of input channels is too large.
  • the multi-audio data is mixed and superimposed, which is easy to cause data overflow, the noise is too large, and the sounds of various terminal sounds are superimposed and the speech is not well resolved. Therefore, there is a problem in the related art that the mixing effect is poor.
  • the embodiment of the invention provides a voice output method and device, so as to at least solve the problem that the mixing effect existing in the related art is poor.
  • a voice output method including: converting audio data input by a terminal into audio data of a predetermined format; performing sound mixing processing on the audio data of the predetermined format; The processed audio data is output to the terminal.
  • converting the audio data input by the terminal into audio data of a predetermined format comprises: when the audio data input by the terminal is mono audio data or two-channel audio data, the mono audio data is Or the two-channel audio data is converted into three-channel audio data.
  • converting the audio data input by the terminal into the audio data of the predetermined format comprises: determining whether the terminal inputting the audio data is a speaking terminal; and if the determination result is yes, utilizing energy and/or signal noise of the audio data Sorting the terminals; selecting a predetermined number of audio data input by the terminal according to the sorted result of the sorting; converting the selected audio data input by the predetermined number of terminals into audio data of a predetermined format.
  • performing sound mixing processing on each audio data of the predetermined format includes: determining whether the terminal inputting the audio data is a speaking terminal; and if the determination result is yes, utilizing energy and/or signal noise of the audio data Sorting the terminals; selecting a predetermined number of input audio data converted into a predetermined format according to the sorted result of the sorting; and performing sound mixing processing on the selected predetermined format audio data.
  • the predetermined number is three.
  • a voice output apparatus including: a conversion module configured to convert audio data input by a terminal into audio data of a predetermined format; and a processing module configured to be in the predetermined format The audio data is subjected to mixing processing; and the output module is configured to output the audio data after the mixing processing to the terminal.
  • the conversion module includes: converting the mono audio data or the two-channel audio data into three when the audio data input by the terminal is mono audio data or two-channel audio data. Channel audio data.
  • the conversion module includes: a first determining unit, configured to determine whether the terminal that inputs the audio data is a speaking terminal; and the first sorting unit is configured to: when the determining result of the first determining unit is yes, Sorting the terminal by using energy and/or signal to noise ratio of the audio data; the first selecting unit is configured to select a predetermined number of audio data input by the terminal according to the sorted result of the sorting; the converting unit is configured to The audio data input by the predetermined number of terminals selected is converted into audio data of a predetermined format.
  • the processing module includes: a second determining unit, configured to determine whether the terminal that inputs the audio data is a speaking terminal; and the second sorting unit is configured to, when the determining result of the second determining unit is yes, Sorting the terminal by using energy and/or signal to noise ratio of the audio data; and selecting, by the second selecting unit, a predetermined number of terminal input audio data converted into a predetermined format according to the sorted result of the sorting; processing And a unit configured to perform mixing processing on the selected predetermined format audio data.
  • the predetermined number is three.
  • the audio data input by the terminal is converted into audio data of a predetermined format; the audio data of the predetermined format is subjected to mixing processing; and the audio data after the mixing processing is output to the terminal.
  • FIG. 1 is a flow chart of a voice output method according to an embodiment of the present invention.
  • FIG. 2 is a block diagram showing the structure of a voice output device according to an embodiment of the present invention.
  • FIG. 3 is a structural block diagram of a conversion module 22 according to an embodiment of the present invention.
  • FIG. 4 is a block diagram showing the structure of a processing module 24 according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of mixing processing of an MCU multi-channel terminal according to an embodiment of the present invention.
  • FIG. 6 is a conversion diagram of a terminal audio data format according to an embodiment of the present invention.
  • FIG. 7 is a flowchart of conversion of a terminal audio data format according to an embodiment of the present invention.
  • FIG. 8 is a diagram of a video conference audio processing framework in accordance with an embodiment of the present invention.
  • FIG. 1 is a flowchart of a voice output method according to an embodiment of the present invention. As shown in FIG. 1 , the process includes the following steps:
  • Step S102 converting audio data input by the terminal into audio data of a predetermined format
  • Step S104 performing mixing processing on the audio data of the predetermined format
  • step S106 the audio data after the mixing process is output to the terminal.
  • the audio data input to the terminal is converted into the audio data of the predetermined format, and then the mixing processing is performed, thereby avoiding the problem of overflow and noise in the mixing, and solving the poor mixing effect existing in the related art.
  • the problem in turn, achieves the effect of improving the quality of the mix.
  • the audio data input by the terminal can be converted into audio data of a predetermined format, for example, in an optional embodiment, when the audio data input by the terminal is mono audio data or two channels For audio data, all of the mono audio data or two-channel audio data input by the terminal can be converted into three-channel audio data. Therefore, the purpose of compatible with the telepresence terminal video conference and the ordinary video conference is achieved.
  • the predetermined number of audio data input by the terminal may be selected for mixing, wherein the audio data may be selected before the conversion, or all the input audio data may be The conversion is performed, and then the audio data input by the predetermined number of converted terminals is mixed.
  • converting the audio data input by the terminal into the audio data of the predetermined format may include: determining whether the terminal inputting the audio data is a speaking terminal; and if the determination result is yes, using the audio data
  • the energy and/or signal to noise ratio sorts the terminals; selects a predetermined number of audio data input by the terminal according to the sorted sort result; converts the selected predetermined number of terminal input audio data into audio data of a predetermined format. Thereby, the quality of the mixing is ensured, and the voice of the speaker with excellent audio quality is well received.
  • performing the mixing processing on each audio data of the predetermined format may include: determining whether the terminal inputting the audio data is a speaking terminal; and if the determination result is yes, using the energy of the audio data and / or signal-to-noise ratio sorts the terminal; selects a predetermined number of terminal-input audio data converted into a predetermined format according to the sorted result; and performs sound mixing processing on the selected predetermined format audio data. This avoids the problem of mixing the audio of all the speakers, resulting in poor mixing quality, resulting in that everyone's speech can not be received well.
  • a predetermined number of audio data input by the terminal may be selected.
  • the audio data input by the first three terminals with the best audio quality may be selected, It mixes.
  • the number of speaking terminals is less than or equal to three, all the hairs can be sent The audio data input by the terminal is mixed.
  • a voice output device is also provided, which is used to implement the above-mentioned embodiments and preferred embodiments, and will not be described again.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 2 is a block diagram showing the structure of a voice output device according to an embodiment of the present invention. As shown in FIG. 2, the device includes a conversion module 22, a processing module 24, and an output module 26. The device will be described below.
  • the conversion module 22 is configured to convert audio data input by the terminal into audio data of a predetermined format; the processing module 24 is connected to the conversion module 22, configured to perform mixing processing on audio data of a predetermined format; and the output module 26 is connected to The processing module 24 is configured to output the audio data after the mixing processing to the terminal.
  • the audio data input by the terminal may be converted into audio data of a predetermined format.
  • the conversion module may include: when the audio data input by the terminal is mono When the audio data or the two-channel audio data is used, the mono audio data or the two-channel audio data is converted into three-channel audio data.
  • FIG. 3 is a structural block diagram of a conversion module 22 according to an embodiment of the present invention.
  • the conversion module 22 includes a first determining unit 32, a first sorting unit 34, a first selecting unit 36, and a converting unit 38.
  • the conversion module 22 will be described below.
  • the first determining unit 32 is configured to determine whether the terminal that inputs the audio data is a speaking terminal; the first sorting unit 34 is connected to the first determining unit 32, and is set to be in the case that the determination result of the first determining unit 32 is YES. And sorting the terminal by using energy and/or signal to noise ratio of the audio data; the first selecting unit 36 is connected to the first sorting unit 34, and is configured to select a predetermined number of audio data input by the terminal according to the sorted sorting result; The unit 38, connected to the first selection unit 36 described above, is arranged to convert the audio data input by the selected predetermined number of terminals into audio data of a predetermined format.
  • FIG. 4 is a block diagram showing the structure of a processing module 24, which includes a second determining unit 42, a second sorting unit 44, a second selecting unit 46, and a processing unit 48, as shown in FIG.
  • the processing module 24 will be described below.
  • the second determining unit 42 is configured to determine whether the terminal that inputs the audio data is a speaking terminal; the second sorting unit 44 is connected to the second determining unit 42 and is set to be in the case that the determination result of the second determining unit 42 is YES. And sorting the terminal by using energy and/or signal to noise ratio of the audio data; the second selecting unit 46 is connected to the second sorting unit 44, and is configured to select a predetermined number of terminal inputs to be converted into a predetermined format according to the sorted sorting result.
  • the audio data; the processing unit 48, coupled to the second selection unit 46, is configured to perform the mixing process on the selected predetermined format audio data.
  • the predetermined number may be three, that is, when the number of the speaking terminals that input the audio data exceeds three, the audio data input by the speaking terminal of the top three is selected according to the quality of the audio for mixing.
  • the embodiment of the present invention further provides a multi-channel audio mixing method and system, thereby reducing the consumption of MCU resources, improving the operating efficiency of the MCU, and improving the video conferencing.
  • a method for multi-channel audio mixing including: sorting voice quality input by a conference terminal in a video conference, and selecting a terminal that speaks in the conference to perform mixing ( Participate in the mixing end), upmix to the terminal audio data to maintain a uniform format, and mix the terminal audio data in a unified format.
  • the audio data after the completion of the mixing is output according to two strategies of participating in the mixing terminal and not participating in the actual mixing terminal.
  • a mixing system is also provided. Including: telepresence video conferencing terminal, ordinary video conferencing terminal, MCU.
  • the telepresence video conference terminal and the ordinary video conference terminal may include multiple conference terminals, the video conference terminal is configured to receive and output audio signals, and the MCU is configured to receive audio data from the audio transmission terminal in the video conference system. The audio data is mixed and sent to each video conference terminal.
  • interworking between mono, dual, and multi-channel video terminals can be achieved.
  • the optimization of the mixing strategy in the MCU can reduce the consumption of resources and ensure that multiple video terminals can simultaneously support the conference in the video conference.
  • FIG. 5 is a flowchart of MCU multi-channel terminal mixing processing according to an embodiment of the present invention. As shown in Figure 5, the process includes the following steps:
  • Step S502 receiving a frame signal of audio data input by the terminal.
  • Step S504 Calculate the current energy and/or signal to noise ratio of each terminal audio data by using a voice activity detection (VAD) method according to the audio data collected by the video terminal.
  • VAD voice activity detection
  • Step S506 the speaking terminal in the current frame conference can be determined by the threshold of the signal to noise ratio.
  • a high signal-to-noise ratio means that someone is currently speaking, and the energy is large, indicating that the current speaker is louder.
  • Step S508 combining the capabilities of the audio data and the signal to noise ratio to sort the current terminal audio quality from high to low.
  • Step S510 if there are more than 3 current terminal speakers, then the top 3 terminals are selected for real audio mixing; if the terminal speaker is less than 3 terminals, the actual speaking terminal is selected to participate in the mixing.
  • Step S512 mixing the terminal audio data of the mono and the two-channel stereo into three-channel data according to the selected participating mixing terminal, and maintaining the audio data format of all the participating mixing terminals as the three-channel data format.
  • Step S514 mixing the up-mixed terminal audio data according to the three-channel audio format.
  • the strategy of mixing is as follows: the output of the mixing end is a mixture of the voices participating in the mixing end except the local end (you cannot hear your own voice), and does not participate in the mixing of the mixing end to participate in the mixing end.
  • the following examples illustrate:
  • Example 1 There are 5 terminals in the video conference: A, B, C, D, E; where the speaking end is A, C, D; where mono A and B, two channels are C and E, three channels
  • the terminal is D.
  • the terminals in the video conference are sorted as A, C, D, B, and E; the participating mixers are A, C, and D; then the A and C terminals are mixed into three-channel audio data for mixing.
  • the audio output of A is C+D
  • the audio output of B is A+C+D
  • the audio output of C is A+D
  • the audio output of D is A+C
  • the audio output of E is A+C+D.
  • step S5166 since the data format after the completion of the mixing is a three-channel audio data format, the data can be down-mixed into a mono and two-channel stereo audio data format.
  • step S5128 the audio output of each terminal can complete the mixing by selecting an appropriate output according to the terminal audio format.
  • the mixing between the mono, two-channel, and three-channel terminals can be easily realized by the above method.
  • the audio data is upmixed to three channels before the terminal audio data is mixed, then the present invention preferably selects the appropriate terminal to participate in the mixing, so the upmixed audio data does not actually exceed 3 terminals. .
  • the process of performing the upmixing does not bring too much resource overhead, and when there are multiple video terminals in the conference, the conference does not consume too much.
  • the resources of the MCU, so the preferred mixing method provided by the present invention can solve the resource overhead problem of various video conferences.
  • FIG. 6 is a conversion diagram of a terminal audio data format according to an embodiment of the present invention.
  • FIG. 7 is a flowchart of converting a terminal audio data format according to an embodiment of the present invention. As shown in FIGS. 6 and 7, the conversion flowchart includes the following steps:
  • Step S702 The three-channel audio data can be mixed on the mono and two-channel stereo audio data before the mixing is performed.
  • the upmixing method is:
  • L represents the left channel
  • C represents the middle channel
  • R represents the right channel
  • M represents the mono channel
  • Step S704 Since the audio data formats after the completion of the mixing are all three-channel audio formats, in order to meet the requirements of different terminal audio formats, the three-channel audio data is down-mixed into mono and two-channel stereo audio data formats. .
  • the three-channel downmix is two channels: L+0.7071*C-->L, R+0.7071*C-->R.
  • the mixing of the ordinary video conference by the above-mentioned mixing method is consistent with the audio result output by the mixing technique in the related art, thereby upmixing and downmixing in the mixing process.
  • the method does not cause a change in the audio data of the mix between the ordinary terminals.
  • FIG. 8 is a block diagram of a video conference audio processing system in accordance with an embodiment of the present invention.
  • the system includes a line interface module 82, a decoding module 84, a mixing module 86 (same as the processing module 24 described above), and an encoding module 88.
  • a line interface module 82 receives data from a decoding module 84 and a mixing module 86.
  • Audio data from a terminal in the telepresence video conferencing system is received by the line interface module 82, wherein the audio data is input to the terminal in a mono, stereo two-channel or three-channel manner.
  • the decoding module 84 performs a decoding process on the audio data.
  • the codecs used in video conferences have codecs such as G711, G722, and AACLD.
  • the mixing module 86 needs to be run to implement mixed transmission of audio in the video conference.
  • a video conference terminal participating in the conference needs to receive the voice of all the telepresence video conference terminals participating in the conference except the local end (ie, the video conference system), so when two When more than two video conferencing systems are talking at the same time, the MCU needs to mix the sounds to ensure that the voice video conferencing systems can hear the sounds simultaneously.
  • the audio data after the mixing is processed by the encoding module 88 to perform data transmission.
  • intercommunication between mono, dual-channel and multi-channel audio terminals in a video conferencing system can be simply realized, and the problem of overflow and noise encountered in mixing can be avoided. While ensuring the quality of the audio output, the smaller MCU resource occupancy can enable more terminal terminals to simultaneously perform video conferences.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the above technical solution of the present invention can be applied to the field of communication, and solves the problem of poor mixing effect existing in the related art, thereby achieving the effect of improving the quality of mixing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本发明提供了一种语音输出方法及装置,其中,该方法包括:将终端输入的音频数据转换成预定格式的音频数据;对预定格式的音频数据进行混音处理;将进行混音处理后的音频数据输出至终端,解决了相关技术中存在的混音效果差的问题,进而达到了提高混音质量的效果。

Description

语音输出方法及装置 技术领域
本发明涉及通信领域,具体而言,涉及一种语音输出方法及装置。
背景技术
视频会议是目前各行各业中通信的常用工具。随着网真技术的应用兴起,这种多屏多声道视频会议更加注重会议中人与人真实面对面的沟通效果,让参会者有一种身临其境的感觉。目前,网真会议系统已经应用于行政会议,远程协同办公,航空航海,以及远程医疗等领域。
随着网真技术在视频会议中的应用,多点控制器(Micro Controller Unit,简称为MCU)不仅需要支持网真终端视频会议(三声道),而且还要完全兼容普通视频会议(单声道和双声道立体声)。所以目前MCU需要支持的终端音频数据类型有单声道、双声道立体声和三声道。在视频会议中有多方会议参与时,MCU需要支持多路音频的语音混音。会议语音混音指的是在多路终端参与会议时,能够让多个会场同时发言,并且将不同会场中所有发言者的信息传达给会议中所有参与会议的与会者。当会议中存在多路不同声道音频数据终端时,MCU在混音之前需要对各个终端进行上混或者下混音处理,使得会议系统中每个终端的音频数据保持统一的格式,在MCU处理完之后音频数据返回终端时又要对各个终端进行上混或者下混。
由于视频通信的需要,MCU需要支持单声道、双声道以及三声道终端之间的通信。MCU接收到声道数音频格式不一样的终端数据时,需要把会议中音频格式转换成统一的格式,低声道往高声道数转称为上混,高声道数音频向低声道数音频转换称为下混。现有的上混和下混技术为:
上混方式:单声道->双声道:0.7071*M->L,0.7071*M->R;
单声道-->三声道:M->C;
双声道->三声道:L->L,C=0,R->R;
下混方式:三声道->单声道:0.7071*L+0.7071*R+C–>M;
三声道->双声道:L+0.7071*C->L,R+0.7071*C->R;
双声道->单声道:L+R->M。
(其中L表示左声道,C表示中间声道,R表示右声道,M表示单声道)
如果对每个终端的音频数据进行上混或下混,然后进行多路终端音频数据进行混音,会对MCU造成极大的负担,同时也会造成较差的语音效果,输入的路数过多的音频数据进行混音叠加,容易造成数据溢出,噪声过大,各种终端声音叠加后造成语音不好分辨。因此,在相关技术中存在着混音效果差的问题。
针对相关技术中存在的混音效果差的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种语音输出方法及装置,以至少解决相关技术中存在的混音效果差的问题。
根据本发明实施例的一个方面,提供了一种语音输出方法,包括:将终端输入的音频数据转换成预定格式的音频数据;对所述预定格式的音频数据进行混音处理;将进行混音处理后的所述音频数据输出至所述终端。
进一步地,将所述终端输入的音频数据转换成预定格式的音频数据包括:当所述终端输入的音频数据为单声道音频数据或双声道音频数据时,将所述单声道音频数据或所述双声道音频数据转换成三声道音频数据。
进一步地,将终端输入的音频数据转换成预定格式的音频数据包括:判断输入音频数据的终端是否为发言终端;在判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;按照所述排序的排序结果选择预定数量的终端输入的音频数据;将选择的所述预定数量的终端输入的音频数据转换成预定格式的音频数据。
进一步地,对所述预定格式的各音频数据进行混音处理包括:判断输入音频数据的终端是否为发言终端;在判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;按照所述排序的排序结果选择预定数量的终端输入的转换成预定格式的音频数据;将选择的所述预定格式的音频数据进行混音处理。
进一步地,所述预定数量为三个。
根据本发明实施例的另一方面,提供了一种语音输出装置,包括:转换模块,设置为将终端输入的音频数据转换成预定格式的音频数据;处理模块,设置为对所述预定格式的音频数据进行混音处理;输出模块,设置为将进行混音处理后的所述音频数据输出至所述终端。
进一步地,所述转换模块包括:当所述终端输入的音频数据为单声道音频数据或双声道音频数据时,将所述单声道音频数据或所述双声道音频数据转换成三声道音频数据。
进一步地,所述转换模块包括:第一判断单元,设置为判断输入音频数据的终端是否为发言终端;第一排序单元,设置为在所述第一判断单元的判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;第一选择单元,设置为按照所述排序的排序结果选择预定数量的终端输入的音频数据;转换单元,设置为将选择的所述预定数量的终端输入的音频数据转换成预定格式的音频数据。
进一步地,所述处理模块包括:第二判断单元,设置为判断输入音频数据的终端是否为发言终端;第二排序单元,设置为在所述第二判断单元的判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;第二选择单元,设置为按照所述排序的排序结果选择预定数量的终端输入的转换成预定格式的音频数据;处理单元,设置为将选择的所述预定格式的音频数据进行混音处理。
进一步地,所述预定数量为三个。
通过本发明实施例,采用将终端输入的音频数据转换成预定格式的音频数据;对所述预定格式的音频数据进行混音处理;将进行混音处理后的所述音频数据输出至所述终端,解决了相关技术中存在的混音效果差的问题,进而达到了提高混音质量的效果。
附图说明
此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的语音输出方法的流程图;
图2是根据本发明实施例的语音输出装置的结构框图;
图3是根据本发明实施例的转换模块22的结构框图;
图4是根据本发明实施例的处理模块24的结构框图;
图5是根据本发明实施例的MCU多路终端混音处理流程图;
图6是根据本发明实施例的终端音频数据格式的转换图;
图7是根据本发明实施例的终端音频数据格式的转换流程图;
图8是根据本发明实施例的视频会议音频处理框架图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
在本实施例中提供了一种语音输出方法,图1是根据本发明实施例的语音输出方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,将终端输入的音频数据转换成预定格式的音频数据;
步骤S104,对上述预定格式的音频数据进行混音处理;
步骤S106,将进行混音处理后的音频数据输出至终端。
通过上述步骤,采用对终端输入的音频数据转换成预定格式的音频数据后,再进行混音处理,避免了混音中的溢出和噪声的问题,解决了相关技术中存在的混音效果差的问题,进而达到了提高混音质量的效果。
在对音频数据进行转换时,可以将终端输入的音频数据转换成预定格式的音频数据,例如,在一个可选的实施例中,当终端输入的音频数据为单声道音频数据或双声道音频数据时,可以将终端输入的单声道音频数据或双声道音频数据全部转换成三声道音频数据。从而实现兼容网真终端视频会议和普通视频会议的目的。
对终端输入的音频数据进行混音处理之前,可以选取预定数量的终端输入的音频数据进行混音,其中,可以对音频数据进行转换之前就进行选取操作,也可以对所有的输入的音频数据均进行转换,再对预定数量的转换后的终端输入的音频数据进行混音。例如,在一个可选的实施例中,将终端输入的音频数据转换成预定格式的音频数据可以包括:判断输入音频数据的终端是否为发言终端;在判断结果为是的情况下,利用音频数据的能量和/或信噪比对终端进行排序;按照排序的排序结果选择预定数量的终端输入的音频数据;将选择的预定数量的终端输入的音频数据转换成预定格式的音频数据。从而既保证混音的质量,又保证音频质量优的发言者的声音被很好的接收。
在另一个可选的实施例中,对预定格式的各音频数据进行混音处理可以包括:判断输入音频数据的终端是否为发言终端;在判断结果为是的情况下,利用音频数据的能量和/或信噪比对终端进行排序;按照排序的排序结果选择预定数量的终端输入的转换成预定格式的音频数据;将选择的预定格式的音频数据进行混音处理。从而避免了对所有的发言者的音频都进行混音,造成混音质量很差,导致每个人的发言都不能被很好的接收的问题。
在进行混音时,可以选取预定数量的终端输入的音频数据,在一个优选的实施例中,当存在多个发言终端时,可以选取音频质量最好的前三个终端输入的音频数据,对其进行混音。当发言终端的个数小于或等于三个时,可以对所有的发 言终端输入的音频数据进行混音处理。
在本实施例中还提供了一种语音输出装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图2是根据本发明实施例的语音输出装置的结构框图,如图2所示,该装置包括转换模块22、处理模块24和输出模块26,下面对该装置进行说明。
转换模块22,设置为将终端输入的音频数据转换成预定格式的音频数据;处理模块24,连接至上述转换模块22,设置为对预定格式的音频数据进行混音处理;输出模块26,连接至上述处理模块24,设置为将进行混音处理后的音频数据输出至终端。
在对音频数据进行转换时,可以将终端输入的音频数据转换成预定格式的音频数据,例如,在一个可选的实施例中,上述转换模块可以包括:当终端输入的音频数据为单声道音频数据或双声道音频数据时,将该单声道音频数据或双声道音频数据转换成三声道音频数据。
图3是根据本发明实施例的转换模块22的结构框图,如图3所示,该转换模块22包括第一判断单元32、第一排序单元34、第一选择单元36和转换单元38。下面对该转换模块22进行说明。
第一判断单元32,设置为判断输入音频数据的终端是否为发言终端;第一排序单元34,连接至上述第一判断单元32,设置为在第一判断单元32的判断结果为是的情况下,利用音频数据的能量和/或信噪比对终端进行排序;第一选择单元36,连接至上述第一排序单元34,设置为按照排序的排序结果选择预定数量的终端输入的音频数据;转换单元38,连接至上述第一选择单元36,设置为将选择的预定数量的终端输入的音频数据转换成预定格式的音频数据。
图4是根据本发明实施例的处理模块24的结构框图,如图4所示,该处理模块24包括第二判断单元42、第二排序单元44、第二选择单元46和处理单元48。下面对该处理模块24进行说明。
第二判断单元42,设置为判断输入音频数据的终端是否为发言终端;第二排序单元44,连接至上述第二判断单元42,设置为在第二判断单元42的判断结果为是的情况下,利用音频数据的能量和/或信噪比对终端进行排序;第二选择单元46,连接至上述第二排序单元44,设置为按照排序的排序结果选择预定数量的终端输入的转换成预定格式的音频数据;处理单元48,连接至上述第二选择单元46设置为将选择的预定格式的音频数据进行混音处理。
其中,上述预定数量可以为三个,即,当输入音频数据的发言终端的数量超过三个时,按照音频的质量选取音频质量排名为前三的发言终端输入的音频数据进行混音。
针对相关技术中的混音质量差的问题,本发明实施例中还提供了一种多声道音频混音方法及系统,从而减轻MCU资源的消耗,提高MCU的运行效率,提升视频会议中多路终端参与会议的能力,并能提高混音的语音质量。
根据本发明的实施例的一个方面,提供了一种多声道音频混音的方法,包括:对视频会议中参会终端输入的语音质量进行排序,选择出会议中发言的终端进行混音(参与混音端),上混至终端音频数据保持统一的格式,对统一格式的终端音频数据进行混音。混音完成后的音频数据再根据参与混音终端和不参与实际混音终端分两种策略进行输出。
根据本发明的实施例的另一个方面,还提供了一种混音系统。包括:网真视频会议终端、普通视频会议终端、MCU。其中,网真视频会议终端和普通视频会议终端可以包含多个参会终端,视频会议终端用于接收和输出音频信号;MCU,用于接收来自于视频会议系统中音频传输终端的音频数据,对该音频数据进行混音处理,并发送至各个视频会议终端。
根据上述的多声道语音处理方法以及系统,可以实现单声道、双声道以及多声道视频终端之间的互通。MCU中混音策略的优化,可以减小资源的消耗,保证视频会议中可以同时支持多路视频终端同时进行会议。
图5是根据本发明实施例的MCU多路终端混音处理流程图。如图5所示,该流程包括以下步骤:
步骤S502,接收终端输入的音频数据的帧信号。
步骤S504,根据视频终端采集的音频数据,利用语音活动检测(Voice Activity detection,简称为VAD)方法,计算出每个终端音频数据当前的能量和/或信噪比。
步骤S506,通过信噪比的阀值可以判断出当前帧会议中的发言终端。信噪比高意味着当前有人在发言,能量大说明当前发言者声音比较大。
步骤S508,结合音频数据的能力和信噪比对当前终端音频质量从高到低进行排序。
步骤S510,如果当前终端发言者超过3个,那么选择排名靠前的3个端进行真正的音频混音;另外如果终端发言者不足3个端,那么选择实际的发言终端参与混音。
步骤S512,根据选择出的参与混音终端,把单声道和双声道立体声的终端音频数据上混为三声道数据,保持所有参与混音终端的音频数据格式都为三声道数据格式。
步骤S514:对上混后的终端音频数据,按照三声道音频格式进行混音。混音的策略为:参与混音端的输出为除去本端之外的其他参与混音端语音的混合(不能听到自己的声音),不参与混音端输出为参与混音端的混合。如下举实例说明:
实例1:视频会议中有5个终端:A,B,C,D,E;其中说话端为A,C,D;其中单声道A和B,双声道为C和E,三声道终端为D。根据上述步骤,视频会议中终端排序为A,C,D,B,E;参与混音端为A,C,D;然后A和C端上混为三声道音频数据,进行混音。A的音频输出为C+D,B的音频输出为A+C+D,C的音频输出为A+D,D的音频输出为A+C,E的音频输出为A+C+D。
步骤S516,由于混音完成后的数据格式都为三声道音频数据格式,可以把数据下混为单声道和双声道立体声音频数据格式。
步骤S518,各个终端的音频输出可根据终端音频格式选择合适的输出即完成混音。
通过上述方法可以简单的实现单声道、双声道以及三声道终端之间的混音。虽然在终端音频数据进行混音之前,会对音频数据进行上混到三声道,然后本发明优选地选择合适的终端参与混音,所以实际上进行上混的音频数据不会超过3个终端。这样无论视频会议系统是普通视频会议还是网真视频会议,进行上混的过程都不会带来过多的资源开销,同时会议中有多路视频终端进行会议时,都不会过多的消耗MCU的资源,所以本发明提供的优选的混音方法可以解决各种视频会议的资源开销问题。
图6是根据本发明实施例的终端音频数据格式的转换图。图7是根据本发明实施例的终端音频数据格式的转换流程图,如图6、7所示,该转换流程图包括如下步骤:
步骤S702:在进行混音之前,可以对单声道和双声道立体声音频数据上混到三声道音频数据。其中,该上混方法为:
其中L表示左声道,C表示中间声道,R表示右声道,M表示单声道;
单声道上混成三声道为:L=0,M-->C,R=0;
双声道上混成三声道为:L=L,C=0,R=R。
相关技术中的音频数据上混方式为三种,单声道上混为双声道,单声道上混 为三声道,双声道上混为三声道,该实施例在相关技术的上混的基础上简化为两种上混方式即可实现不同音频格式数据进行混音。
步骤S704:由于混音完成后的音频数据格式都为三声道音频格式,为了满足不同终端音频格式的需求,把三声道的音频数据下混为单声道和双声道立体声音频数据格式。
该实例在相关技术中的音频的三种下混方式的基础上,简化为两种音频下混方式,其中:其中L表示左声道,C表示中间声道,R表示右声道,M表示单声道;
三声道下混为单声道为:M=L+C+R;
三声道下混为双声道为:L+0.7071*C-->L,R+0.7071*C-->R。
通过上述音频格式上混和下混的方式,不仅可以实现视频会议中网真终端之间的互通,还可以实现与普通视频会议终端之间的互通。通过上混和下混并不会破坏普通视频终端音频数据的输出。下面结合具体实施例进行说明:
实例2:假设当前视频会议中有4个终端:A,B,C,D;其中A,B为说话端,A和C为单声道,B和D为双声道。那么A上混为三声道后为:La=0,Ca=A,Ra=0;B终端上混为三声道后为:Lb=Lb,Cb=0,Rb=Rb;经过混音之后,C终端的输出为:A+Lb+Rb,D终端的输出:左声道Lb+0.7071*A,右声道Rb+0.7071*A。从该实施例中可以得出,普通视频会议通过上述的混音方法实现的混音,与相关技术中的混音技术输出的音频结果是一致的,由此混音过程中的上混和下混方式不会造成普通终端之间混音音频数据的改变。
图8是根据本发明实施例的视频会议音频处理系统框架图。该系统包括线路接口模块82、解码模块84、混音模块86(同上述的处理模块24)和编码模块88,利用该框架图可以实现如下操作:
通过线路接口模块82接收来自于网真视频会议系统中一个终端的音频数据,其中,该音频数据以单声道、立体声双声道或者三声道的方式输入该终端。解码模块84对该音频数据进行解码处理。目前视频会议中使用的编解码有G711,G722,AACLD等编解码。当视频会议中有超过两个终端参与时,需要运行混音模块86实现视频会议中音频的混合传输。一般来说,在进行视频会议的过程中,参加会议的一个视频会议终端需要同时接收到除了本端(即本视频会议系统)以外所有参加会议的网真视频会议终端的声音,因此当两个或两个以上的视频会议系统同时有人说话时,就需要MCU对这些声音进行混音处理以保证各网真视频会议系统可以同时听到这些声音。经过混音后的音频数据,再经过编码模块88编码处理后进行数据的传输。
综上,利用本发明提供的实施例,可以简单的实现视频会议系统中单声道、双声道以及多声道音频终端之间的互通,在避免混音中遇到的溢出和噪声的问题、保证音频输出的质量的同时,较小的MCU资源占用可以使得更多路终端同时进行视频会议。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
本发明的上述技术方案,可以应用于通信领域,解决了相关技术中存在的混音效果差的问题,进而达到了提高混音质量的效果。

Claims (10)

  1. 一种语音输出方法,包括:
    将终端输入的音频数据转换成预定格式的音频数据;
    对所述预定格式的音频数据进行混音处理;
    将进行混音处理后的所述音频数据输出至所述终端。
  2. 根据权利要求1所述的方法,其中,将所述终端输入的音频数据转换成预定格式的音频数据包括:
    当所述终端输入的音频数据为单声道音频数据或双声道音频数据时,将所述单声道音频数据或所述双声道音频数据转换成三声道音频数据。
  3. 根据权利要求1所述的方法,其中,将终端输入的音频数据转换成预定格式的音频数据包括:
    判断输入音频数据的终端是否为发言终端;
    在判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;
    按照所述排序的排序结果选择预定数量的终端输入的音频数据;
    将选择的所述预定数量的终端输入的音频数据转换成预定格式的音频数据。
  4. 根据权利要求1所述的方法,其中,对所述预定格式的各音频数据进行混音处理包括:
    判断输入音频数据的终端是否为发言终端;
    在判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;
    按照所述排序的排序结果选择预定数量的终端输入的转换成预定格式的音频数据;
    将选择的所述预定格式的音频数据进行混音处理。
  5. 根据权利要求3或4所述的方法,其中,所述预定数量为三个。
  6. 一种语音输出装置,包括:
    转换模块,设置为将终端输入的音频数据转换成预定格式的音频数据;
    处理模块,设置为对所述预定格式的音频数据进行混音处理;
    输出模块,设置为将进行混音处理后的所述音频数据输出至所述终端。
  7. 根据权利要求6所述的装置,其中,所述转换模块包括:
    当所述终端输入的音频数据为单声道音频数据或双声道音频数据时,将所述单声道音频数据或所述双声道音频数据转换成三声道音频数据。
  8. 根据权利要求6所述的装置,其中,所述转换模块包括:
    第一判断单元,设置为判断输入音频数据的终端是否为发言终端;
    第一排序单元,设置为在所述第一判断单元的判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;
    第一选择单元,设置为按照所述排序的排序结果选择预定数量的终端输入的音频数据;
    转换单元,设置为将选择的所述预定数量的终端输入的音频数据转换成预定格式的音频数据。
  9. 根据权利要求6所述的装置,其中,所述处理模块包括:
    第二判断单元,设置为判断输入音频数据的终端是否为发言终端;
    第二排序单元,设置为在所述第二判断单元的判断结果为是的情况下,利用所述音频数据的能量和/或信噪比对所述终端进行排序;
    第二选择单元,设置为按照所述排序的排序结果选择预定数量的终端输入的转换成预定格式的音频数据;
    处理单元,设置为将选择的所述预定格式的音频数据进行混音处理。
  10. 根据权利要求8或9所述的装置,其中,所述预定数量为三个。
PCT/CN2015/086171 2014-11-24 2015-08-05 语音输出方法及装置 WO2016082579A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410682654.7A CN105704423A (zh) 2014-11-24 2014-11-24 语音输出方法及装置
CN201410682654.7 2014-11-24

Publications (1)

Publication Number Publication Date
WO2016082579A1 true WO2016082579A1 (zh) 2016-06-02

Family

ID=56073543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086171 WO2016082579A1 (zh) 2014-11-24 2015-08-05 语音输出方法及装置

Country Status (2)

Country Link
CN (1) CN105704423A (zh)
WO (1) WO2016082579A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053399A (zh) * 2019-12-26 2021-06-29 浙江大搜车软件技术有限公司 一种多路音频混音方法及装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060696B (zh) * 2018-01-19 2021-06-15 腾讯科技(深圳)有限公司 混音方法及装置、终端及可读存储介质
CN109451194B (zh) * 2018-09-28 2020-11-24 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) 一种会议混音方法及装置
CN111354356B (zh) * 2018-12-24 2024-04-30 北京搜狗科技发展有限公司 一种语音数据处理方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102457700A (zh) * 2010-10-26 2012-05-16 中兴通讯股份有限公司 音频数据传输方法及系统
CN102655584A (zh) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 一种远程呈现技术中媒体数据发送和播放的方法及系统
CN103379424A (zh) * 2012-04-24 2013-10-30 华为技术有限公司 一种混音方法及多点控制服务器

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7499075B2 (en) * 2004-09-28 2009-03-03 Seiko Epson Corporation Video conference choreographer
CN102065265B (zh) * 2009-11-13 2012-10-17 华为终端有限公司 实现混音的方法、装置和系统
CN102222503B (zh) * 2010-04-14 2013-08-28 华为终端有限公司 一种音频信号的混音处理方法、装置及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102457700A (zh) * 2010-10-26 2012-05-16 中兴通讯股份有限公司 音频数据传输方法及系统
CN102655584A (zh) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 一种远程呈现技术中媒体数据发送和播放的方法及系统
CN103379424A (zh) * 2012-04-24 2013-10-30 华为技术有限公司 一种混音方法及多点控制服务器

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053399A (zh) * 2019-12-26 2021-06-29 浙江大搜车软件技术有限公司 一种多路音频混音方法及装置

Also Published As

Publication number Publication date
CN105704423A (zh) 2016-06-22

Similar Documents

Publication Publication Date Title
CN102177734B (zh) 基于公共场景的会议系统
JP5534813B2 (ja) 多言語会議を実現するシステム、方法、及び多地点制御装置
EP2490426B1 (en) Method, apparatus and system for implementing audio mixing
US9093071B2 (en) Interleaving voice commands for electronic meetings
WO2016082579A1 (zh) 语音输出方法及装置
EP1360798A1 (en) Control unit for multipoint multimedia/audio conference
EP2560160B1 (en) Mixing processing method, device and system of audio signals
CN112513981A (zh) 空间音频参数合并
JP6010176B2 (ja) オーディオ信号のデコーディング方法及びその装置
US8340269B2 (en) Method, device, and system for processing multi-channel audio signals
US20220165281A1 (en) Audio codec extension
WO2014094461A1 (zh) 视频会议中的视音频信息的处理方法、装置及系统
EP2590360A1 (en) Multi-point sound mixing and long distance view showing method, device and system
CN102915736B (zh) 混音处理方法和混音处理系统
GB2580899A (en) Audio representation and associated rendering
WO2012034476A1 (zh) 级联会议中级联会场的处理方法、装置及级联会议系统
WO2012055291A1 (zh) 音频数据传输方法及系统
CN110620986B (zh) 音频处理算法的调度方法、装置、音频处理器和存储介质
EP3031048B1 (en) Encoding of participants in a conference setting
JP4548147B2 (ja) 音声会議システム及び話者特定用処理装置
KR20170013860A (ko) 오브젝트-기반 원격 회의 프로토콜
US9591037B2 (en) Distributed audio mixing and forwarding
KR101785381B1 (ko) 미디어 패킷 처리와 피어 연결이 분리된 분산 처리형 화상 통신 시스템 및 그 방법
Terentiev et al. Efficient parametric audio coding for interactive rendering: The upcoming ISO/MPEG standard on spatial audio object coding (SAOC)
JP2009290410A (ja) テレビ会議装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15862634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15862634

Country of ref document: EP

Kind code of ref document: A1