WO2018094968A1 - 一种音频的处理方法、装置和媒体服务器 - Google Patents

一种音频的处理方法、装置和媒体服务器 Download PDF

Info

Publication number
WO2018094968A1
WO2018094968A1 PCT/CN2017/082884 CN2017082884W WO2018094968A1 WO 2018094968 A1 WO2018094968 A1 WO 2018094968A1 CN 2017082884 W CN2017082884 W CN 2017082884W WO 2018094968 A1 WO2018094968 A1 WO 2018094968A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
session
audio frequency
session member
channel
Prior art date
Application number
PCT/CN2017/082884
Other languages
English (en)
French (fr)
Inventor
牛超
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018094968A1 publication Critical patent/WO2018094968A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic

Definitions

  • the present application relates to, but is not limited to, the field of communications, and more particularly to an audio processing method, apparatus, and media server.
  • conference call is a more common application scenario of stereo multi-person session technology.
  • CS Circuit Switch
  • PS Packet Switch
  • the conference call can only hear each user talking during the communication process, but because the number of users in the conference call is large, it is impossible to distinguish. Which user is speaking and the user experience is poor.
  • Embodiments of the present invention provide a method, an apparatus, and a media server for processing audio, which are used to distinguish conversation members in a stereo multi-person session.
  • an embodiment of the present invention provides a method for processing audio, including: sampling voice data of each session member of a stereo multi-person session to determine an audio frequency value of each session member; and setting each tone according to an audio frequency value.
  • setting an audio feature when each session member speaks according to an audio frequency value includes: determining whether there is an audio frequency value in a same preset audio frequency range; if present, being in the same preset audio frequency range Each session member sets different audio characteristics; if not, sets the same or different audio characteristics for each session member.
  • setting an audio feature when each session member speaks according to an audio frequency value includes: sorting audio frequency values of each session member; Rate value adjacent session members set different audio characteristics.
  • the audio feature is set for the session member in the following manner: acquiring volume information of each channel of the session member; obtaining a channel volume specific gravity from the calculated plurality of channel volume weights; The channel volume specificity adjusts the volume information of each channel of the session member.
  • the method further includes: acquiring an audio feature of the speaking member in the case of speaking by any session member; adjusting the input according to the audio feature The audio of the session member audio player other than the speaking member of the speaking session.
  • an embodiment of the present invention further provides an audio processing apparatus, including: a sampling module, configured to sample voice data of each session member of a stereo multi-person session to determine an audio frequency value of each session member. And a setting module configured to set an audio feature when each session member speaks according to an audio frequency value, wherein the audio feature is a volume specific gravity of each channel when the session member speaks.
  • the setting module includes: a determining unit configured to determine whether there is an audio frequency value in a same preset audio frequency range; and the first setting unit is configured to have audio in a same preset audio frequency range In the case of frequency values, audio features that are different for each session member in the same preset audio frequency range are set; for each session member in the absence of audio frequency values in the same preset audio frequency range Set the same or different audio characteristics.
  • the setting module includes: a sorting unit configured to sort audio frequency values of each session member; and a second setting unit configured to set a session member adjacent to the audio frequency value in the sorting Different audio features.
  • the setting module is configured to set an audio feature for the session member according to the following manner: acquiring volume information of each channel of the session member; acquiring one channel from the calculated plurality of channel volume weights Volume specific gravity; adjusting the volume information of each channel of the session member according to the channel volume specific gravity.
  • the method further includes: an obtaining module, configured to speak in any session member In case, the audio feature of the speaking member is obtained; the input module is arranged to adjust the audio of the session member player other than the speaking member of the speaking according to the audio feature.
  • an embodiment of the present invention further provides a media server, including: a collector, which samples voice data of each session member of a stereo multi-person session; and a processor that determines an audio frequency of each session member that is collected. The value, the audio feature of each session member when speaking is set according to the audio frequency value, wherein the audio feature is the volume specific gravity of each channel when the session member speaks.
  • the processor is configured to determine whether there is an audio frequency value in a same preset audio frequency range; in the case that there is an audio frequency value in the same preset audio frequency range, the same preset is
  • Each session member of the audio frequency range sets different audio characteristics; in the absence of audio frequency values in the same preset audio frequency range, the same or different audio features are set for each session member.
  • the processor is arranged to order audio frequency values for each session member and to set different audio characteristics for session members adjacent to the audio frequency values in the ranking.
  • an embodiment of the present invention further provides a computer readable storage medium storing computer executable instructions, which are implemented by the processor to implement the audio processing method.
  • the voice data of each session member is sampled, and then the audio features of each session member are set according to the audio frequency value of each session member, and different audio features may be Presenting different stereo effects, the session members can feel a virtual agent when listening to other users speaking, as if each session member is in a different virtual seat, it is easier to distinguish the speaking members who speak, the user experience better.
  • FIG. 1 is a flow chart showing a method of processing audio in a first embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of an audio processing apparatus in a second embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of an audio processing apparatus in a second embodiment of the present invention.
  • FIG. 4 is a topological diagram of a relationship between a media server and each conference member in a fourth embodiment of the present invention.
  • FIG. 5 is a schematic diagram showing a virtual space location setting in a fourth embodiment of the present invention.
  • FIG. 6 is a schematic diagram showing another virtual space location setting in the fourth embodiment of the present invention.
  • the embodiments of the present invention provide a method, an apparatus, and a media server for processing audio.
  • the embodiments of the present invention are further described in detail below with reference to the accompanying drawings and embodiments. It is to be understood that the embodiments described herein are merely illustrative of the application and are not limiting.
  • a first embodiment of the present invention provides a method for processing audio.
  • the flow of the method is as shown in FIG. 1 and includes steps S102 to S104:
  • stereo is a sound source including at least two channels, which is different from mono. Mono has only one channel, so it just satisfies the sound; however, since stereo has multiple channels, it can produce a stereoscopic feeling when the user listens, so the multi-channel sound source is called stereo. For example, when watching a movie in a movie theater, its sound is stereo when played. Since it is stereo, subsequent audio adjustments are possible.
  • the voice data of each session member is sampled, and then the audio features of each session member are set according to the audio frequency value of each session member, and different audio features may be Presenting different stereo effects, the session members can feel a virtual agent when listening to other users speaking, as if each session member is in a different virtual seat, it is easier to distinguish the speaking members who speak, the user experience better.
  • setting the audio characteristics of each session member to speak according to the audio frequency value may include multiple modes, which are described below in two cases.
  • the first case judging whether there is an audio frequency value in the same preset audio frequency range; if it exists, it indicates that the user's voice characteristics are similar, which makes it easy for people to feel confused when listening, so it is in the same preset
  • Each session member of the audio frequency range has different audio characteristics.
  • the audio characteristics of the two session members cannot be adjusted when the audio feature adjustment is performed for the two session members. Tune to the same, once tuned to the same, it will let other session members listen to the two users when speaking, can not distinguish which is speaking, therefore, set for each session member in the same preset audio frequency range
  • the audio features are different, and the audio features are more clearly distinguished. It is better to hear the feeling of "one speaking on the left and one speaking on the right".
  • Session members can distinguish different users by audio frequency. Of course, in order to distinguish different speakers more clearly, different audio features can be set for different session members.
  • Case 2 When setting the audio characteristics of each session member to speak according to the audio frequency value, the audio frequency values of each session member can also be sorted; thus, setting the session member adjacent to the audio frequency value in the sorting Different audio features can distinguish between conversation members with similar audio. In the actual setting, three or four adjacent users can also set different audio features to better distinguish different speakers, so that users can feel different speakers in different orientations, users The experience is better.
  • the audio characteristics of the speaking member can be obtained; according to the audio Feature adjustments Enter the audio of the audio player for each of the other session members except the speaking session member.
  • a second embodiment of the present invention provides an audio processing apparatus.
  • the structure of the apparatus is as shown in FIG. 2, and includes: a sampling module 10 configured to sample voice data of each session member of a stereo multi-person session, Determining an audio frequency value of each session member; the setting module 20, coupled with the sampling module 10, is configured to set an audio feature when each session member speaks according to an audio frequency value, wherein the audio feature is each voice when the conversation member speaks The volume of the road.
  • the setting module 20 may include: a determining unit configured to determine whether there is an audio frequency value in the same preset audio frequency range; and the first setting unit is configured to be in the presence of an audio frequency value in the same preset audio frequency range Set different audio characteristics for each session member in the same preset audio frequency range; set the same or different for each session member if there is no audio frequency value in the same preset audio frequency range Audio features.
  • the setting module 20 may further include: a sorting unit configured to sort the audio frequency values of each session member; and a second setting unit configured to set different audio features for the session members adjacent to the audio frequency values in the sorting.
  • the setting module is configured to set audio features for the session members as follows: obtain volume information of each channel of the session member; obtain a channel volume weight from the calculated plurality of channel volume proportions; Adjust the volume information for each channel of the session member.
  • FIG. 3 shows the structure of the audio processing device in the above-mentioned conference call, which may further include an acquisition module 30 coupled with the setting module 20, configured to acquire the audio characteristics of the speaking member in the event that any session member speaks.
  • the input module 40 coupled to the acquisition module 30, is configured to adjust the audio input to the session member player other than the speaking member of the speech based on the audio characteristics.
  • a third embodiment of the present invention provides a media server, including: a collector, which samples voice data of each session member of a stereo multi-person session; and a processor that determines an audio frequency value of each session member that is collected.
  • the audio feature of each session member is set according to the audio frequency value, wherein the audio feature is the volume specific gravity of each channel when the session member speaks.
  • the processor is configured to determine whether there is a range of the same preset audio frequency Audio frequency value; in the presence of audio frequency values in the same preset audio frequency range, set different audio characteristics for each session member in the same preset audio frequency range; in the absence of the same preset audio In the case of audio frequency values for the frequency range, the same or different audio characteristics are set for each session member.
  • the processor may be arranged to sort the audio frequency values of each session member and set different audio features for the session members adjacent to the audio frequency values in the ranking.
  • a fourth embodiment of the present invention provides a method for processing audio in a conference call, which mainly relates to improvement of a mixing function of a media server responsible for mixing in a conference call during a CS domain and a PS domain.
  • a virtual space location set in a virtual orientation, one or more virtual space locations may exist in a virtual orientation
  • the feeling therefore, combined with the characteristics of the conference member's voice, is mixed by the sense of orientation to distinguish the voice of each member of the conference call.
  • the method comprises steps (1) to (3).
  • the media server samples the sound data of each meeting member.
  • each terminal passes the RTP (real-time transmission).
  • the protocol Real-time Transport Protocol
  • the media server can perform terminal uplink voice data sampling at the stage when each member has just joined the conference (for example, the first 10 seconds of accessing the conference call). Because when you usually connect to the phone, both parties will say "Hello ⁇ , hello!, etc., for a simple greeting. According to the order of accessing the conference call, sampling is performed in sequence.
  • the media server After the media server obtains the sampled data, the function of the voiceprint recognition software/hardware is called, and the uplink sound data of each member is analyzed, the frequency value is calculated, and a sort result from low to high is generated.
  • the media server mixes the sound according to the sorting result in combination with the orientation sense mix function.
  • the media server completes the mixing process according to the preset mixing rules (mixing
  • mixing is the process of adjusting the audio features).
  • This mix rule can take into account two dimensions: the member location (which virtual member is expected to be in which virtual location), and the member sound characteristics. Among them, the position can be divided into left, far, medium, medium, near, right, and right, 6 positions; and member sound characteristics can be divided into two groups of high frequency / low frequency, or can be simply based on gender (sound The frequency of the high and low) to group.
  • the volte conference call members of the PS domain support up to 6 channels, that is, in addition to this machine, there are 6 terminals involved.
  • assign 3 members to the low frequency group assign 3 members to the high frequency group, assign the low frequency group to the side close to the local machine in the virtual space position, and assign the virtual space position to the high frequency group. In the far side.
  • each frequency group the members are assigned the left, middle, and right positions according to the frequency from low, medium, and high, so that they are exactly six virtual space positions, as shown in FIG.
  • the intensity of the sound is divided by 1/3 of each of the left, center, and right positions, and the far side and the near side can be divided by 0.5 to 0.5, 0.45 to 0.55, 0.4 to 0.6, and the like. This process maximizes the isolation of sound interference between each member, increases the signal-to-noise ratio, and improves the recognition rate of sound.
  • the conference call is 3 channels
  • the position can be subdivided, where n is the number of conference members, and the angle between the two adjacent virtual space locations is 180/((N-1)/2-1), where N is the total number of participants. In this embodiment, N is 9, so the angle is 60 degrees, and the setting is shown in FIG. 6.
  • the voice data of the conference call member assigned to the virtual space location on the left side is only mixed into the left channel; the voice data of the conference call member assigned to the virtual space location on the right side is only mixed into the right channel;
  • the voice data of the members of the left non-right virtual space position is calculated according to the angle value to be mixed into the left and right channels, and the calculation manner is as follows:
  • each conference member assigns a virtual seat, that is, each conference member has a virtual location corresponding to the conference member, which is equivalent to letting user A sit on the chair 1 and wants user B to sit on the chair 2, wanting to let User C sits on the chair 3 and so on.
  • This process is to achieve the setting of users with similar sound data in different orientations.
  • the voice of each conference member is tuned, that is, the audio feature is set for each conference member, so that each conference member wishing to present is seated in the corresponding virtual seat. .
  • the embodiment of the present invention achieves the effect of distinguishing each conference member in a conference call as a whole, and the user experience is high.
  • the embodiment of the invention further provides a computer readable storage medium storing computer executable instructions, which are implemented by the processor to implement the audio processing method.
  • each module/unit in the above embodiment may be implemented in the form of hardware, for example, by implementing an integrated circuit to implement its corresponding function, or may be implemented in the form of a software function module, for example, executing a program stored in the memory by a processor. / instruction to achieve its corresponding function.
  • Embodiments of the invention are not limited to any specific form of combination of hardware and software.
  • the voice data of each session member is sampled, and then the audio features of each session member are set according to the audio frequency value of each session member, and different audio features may be Presenting different stereo effects, the session members can feel a virtual agent when listening to other users speaking, as if each session member is in a different virtual seat, it is easier to distinguish the speaking members who speak, the user experience better.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种音频的处理方法和装置,其中,该方法包括:对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值(S102);根据音频频率值设置每个会话成员发言时的音频特征,其中,音频特征为会话成员发言时每个声道的音量比重(S104)。

Description

一种音频的处理方法、装置和媒体服务器 技术领域
本申请涉及但不限于通讯领域,尤指一种音频的处理方法、装置和媒体服务器。
背景技术
对于立体声多人会话技术,已经较为成熟,电话会议就是立体声多人会话技术一个较为常用的应用场景。CS(电路交换,Circuit Switch)域和PS(包交换,Packet Switch)域的电话会议在沟通过程中,只能听到每个用户在说话,但由于电话会议中用户数量较多,因此无法区分是哪个用户在发言,用户体验较差。
发明概述
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本发明实施例提供一种音频的处理方法、装置和媒体服务器,用以立体声多人会话中区分发言的会话成员。
一方面,本发明实施例提供一种音频的处理方法,包括:对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值;根据音频频率值设置每个会话成员发言时的音频特征,其中,所述音频特征为会话成员发言时每个声道的音量比重。
在一实施方式中,根据音频频率值设置每个会话成员发言时的音频特征,包括:判断是否存在处于同一预设音频频率范围的音频频率值;如果存在,则为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;如果不存在,则为每个会话成员设置相同或不同的音频特征。
在一实施方式中,根据音频频率值设置每个会话成员发言时的音频特征,包括:将每个会话成员的音频频率值进行排序;为所述排序中音频频 率值相邻的会话成员设置不同的音频特征。
在一实施方式中,按照如下方式为所述会话成员设置音频特征:获取会话成员每个声道的音量信息;从计算得到的多个声道音量比重中获取一个声道音量比重;根据所述声道音量比重调整所述会话成员每个声道的音量信息。
在一实施方式中,根据音频频率值设置每个会话成员发言时的音频特征之后,还包括:在任一会话成员发言的情况下,获取发言的会话成员的音频特征;根据所述音频特征调整输入除所述发言的会话成员外的会话成员音频播放器的音频。
另一方面,本发明实施例还提供一种音频的处理装置,包括:采样模块,设置为对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值;设置模块,设置为根据音频频率值设置每个会话成员发言时的音频特征,其中,所述音频特征为会话成员发言时每个声道的音量比重。
在一实施方式中,所述设置模块包括:判断单元,设置为判断是否存在处于同一预设音频频率范围的音频频率值;第一设置单元,设置为在存在处于同一预设音频频率范围的音频频率值的情况下,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;在不存在处于同一预设音频频率范围的音频频率值的情况下,为每个会话成员设置相同或不同的音频特征。
在一实施方式中,所述设置模块包括:排序单元,设置为将每个会话成员的音频频率值进行排序;第二设置单元,设置为为所述排序中音频频率值相邻的会话成员设置不同的音频特征。
在一实施方式中,所述设置模块设置为按照如下方式为所述会话成员设置音频特征:获取会话成员每个声道的音量信息;从计算得到的多个声道音量比重中获取一个声道音量比重;根据所述声道音量比重调整所述会话成员每个声道的音量信息。
在一实施方式中,还包括:获取模块,设置为在任一会话成员发言的 情况下,获取发言的会话成员的音频特征;输入模块,设置为根据所述音频特征调整输入除所述发言的会话成员外的会话成员播放器的音频。
另一方面,本发明实施例还提供一种媒体服务器,包括:采集器,对立体声多人会话的每个会话成员的语音数据进行采样;处理器,确定采集到的每个会话成员的音频频率值,根据音频频率值设置每个会话成员发言时的音频特征,其中,所述音频特征为会话成员发言时每个声道的音量比重。
在一实施方式中,所述处理器,设置为判断是否存在处于同一预设音频频率范围的音频频率值;在存在处于同一预设音频频率范围的音频频率值的情况下,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;在不存在处于同一预设音频频率范围的音频频率值的情况下,为每个会话成员设置相同或不同的音频特征。
在一实施方式中,所述处理器,设置为将每个会话成员的音频频率值进行排序,并为所述排序中音频频率值相邻的会话成员设置不同的音频特征。
另一方面,本发明实施例还提供一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现上述音频的处理方法。
本发明实施例在立体声多人会话的过程中,对每个会话成员的语音数据进行采样,再根据每个会话成员的音频频率值设置每个会话成员发言时的音频特征,不同的音频特征可以呈现不同的立体声效果,就可以让会话成员在收听其他用户发言时,带来一种虚拟坐席的感觉,好像每个会话成员都处于不同的虚拟坐席上,更加容易区分发言的会话成员,用户体验较好。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图概述
图1是本发明第一实施例中音频的处理方法的流程图;
图2是本发明第二实施例中音频的处理装置的结构示意图;
图3是本发明第二实施例中音频的处理装置的结构示意图;
图4是本发明第四实施例中媒体服务器与每个会议成员的关系拓扑图;
图5是本发明第四实施例中一种虚拟空间位置设置示意图;
图6是本发明第四实施例中另一种虚拟空间位置设置示意图。
详述
本发明实施例提供了一种音频的处理方法、装置和媒体服务器,以下结合附图以及实施例,对本发明实施例进行进一步详细说明。应当理解,此处所描述的实施例仅仅用以解释本申请,并不限定本申请。
本发明第一实施例提供一种音频的处理方法,该方法的流程如图1所示,包括步骤S102至S104:
S102,对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值。其中,立体声为至少包括两个声道的声源,其与单声道不同。单声道只有一个声道,所以只是满足听到声音而已;然而,由于立体声具有多个声道,在用户收听时就可以产生一种立体的感觉,所以,多声道的声源称作为立体声,例如,在电影院观看电影时,其声音在播放时就是立体声。由于其是立体声,因此,才可以进行后续的音频调整。
S104,根据音频频率值设置每个会话成员发言时的音频特征,其中,音频特征为会话成员发言时每个声道的音量比重。
本发明实施例在立体声多人会话的过程中,对每个会话成员的语音数据进行采样,再根据每个会话成员的音频频率值设置每个会话成员发言时的音频特征,不同的音频特征可以呈现不同的立体声效果,就可以让会话成员在收听其他用户发言时,带来一种虚拟坐席的感觉,好像每个会话成员都处于不同的虚拟坐席上,更加容易区分发言的会话成员,用户体验较好。
实现的过程中,根据音频频率值设置每个会话成员发言时的音频特征可以包括多种方式,下面以两种情况进行说明。
第一种情况:判断是否存在处于同一预设音频频率范围的音频频率值;如果存在,则说明有用户的声音特质比较相像,容易让人在收听时产生混淆感,因此,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征。
比如,当有两个会话成员的音频频率值都处于同一预设音频频率范围内,则此时在为这两个会话成员进行音频特征调整时,就不可以将这两个会话成员的音频特征调成相同的,一旦调成相同的,就会让其他会话成员在收听这个两个用户发言时,区分不出是哪个在发言,因此,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征,而且音频特征区分越明显越好,最好听起像“一个在左边说话,一个在右边说话”的感觉。
如果不存在处于同一预设音频频率范围的音频频率值,则说明每个会话成员之间容易混淆的概率较小,因此,在设置时,可以为每个会话成员设置相同的音频特征,收听的会话成员可以通过音频频率区分出不同的用户,当然,为了更加清楚的区分不同的发言人,可以为不同的会话成员设置不同的音频特征。
第二种情况:在根据音频频率值设置每个会话成员发言时的音频特征时,还可以将每个会话成员的音频频率值进行排序;这样,为排序中音频频率值相邻的会话成员设置不同的音频特征,就可以区分出音频相似的会话成员。在实际设置时,还可以将相邻的三个或者四个用户都设置不同的音频特征,以更好的区分不同的发言人,可以让用户感受到不同的发言人处于不同的方位上,用户体验较好。
设置时,可以按照如下方式为会话成员设置音频特征:获取会话成员每个声道的音量信息;从计算得到的多个声道音量比重中获取一个声道音量比重;根据声道音量比重调整会话成员每个声道的音量信息。
在根据音频频率值设置每个会话成员发言时的音频特征之后,在任意一个会话成员发言时,就可以获取发言的会话成员的音频特征;根据音频 特征调整输入除发言的会话成员外的其他每个会话成员音频播放器的音频。
本发明第二实施例提供了一种音频的处理装置,该装置的结构示意如图2所示,包括:采样模块10,设置为对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值;设置模块20,与采样模块10耦合,设置为根据音频频率值设置每个会话成员发言时的音频特征,其中,音频特征为会话成员发言时每个声道的音量比重。
其中,设置模块20可以包括:判断单元,设置为判断是否存在处于同一预设音频频率范围的音频频率值;第一设置单元,设置为在存在处于同一预设音频频率范围的音频频率值的情况下,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;在不存在处于同一预设音频频率范围的音频频率值的情况下,为每个会话成员设置相同或不同的音频特征。
设置模块20还可以包括:排序单元,设置为将每个会话成员的音频频率值进行排序;第二设置单元,设置为为排序中音频频率值相邻的会话成员设置不同的音频特征。
其中,设置模块设置为按照如下方式为会话成员设置音频特征:获取会话成员每个声道的音量信息;从计算得到的多个声道音量比重中获取一个声道音量比重;根据声道音量比重调整会话成员每个声道的音量信息。
图3示出了上述电话会议中音频的处理装置的结构,其还可包括:获取模块30,与设置模块20耦合,设置为在任一会话成员发言的情况下,获取发言的会话成员的音频特征;输入模块40,与获取模块30耦合,设置为根据音频特征调整输入除发言的会话成员外的会话成员播放器的音频。
本发明第三实施例提供了一种媒体服务器,其包括:采集器,对立体声多人会话的每个会话成员的语音数据进行采样;处理器,确定采集到的每个会话成员的音频频率值,根据音频频率值设置每个会话成员发言时的音频特征,其中,音频特征为会话成员发言时每个声道的音量比重。
其中,处理器,可设置为判断是否存在处于同一预设音频频率范围的 音频频率值;在存在处于同一预设音频频率范围的音频频率值的情况下,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;在不存在处于同一预设音频频率范围的音频频率值的情况下,为每个会话成员设置相同或不同的音频特征。
所述处理器,可设置为将每个会话成员的音频频率值进行排序,并为排序中音频频率值相邻的会话成员设置不同的音频特征。
本发明第四实施例提供了一种电话会议中音频的处理方法,其主要涉及CS域和PS域的负责在电话会议过程中进行混音的媒体服务器的混音功能的改进。本实施例为了建立会议成员的虚拟空间位置(设置在虚拟方位上,一个虚拟方位上可能存在一个或多个虚拟空间位置)模型,即给会议成员一种每个参会人员都有一个虚拟座位的感觉,因此,结合会议成员声音的特性,通过方位感进行混音,以对电话会议中每个成员的语音进行区分。该方法包括步骤(1)至(3)。
(1)媒体服务器采样每一路会议成员的声音数据。
如图4所示,在电话会议建立后,就语音数据流角度来说,所有参与会议的7路终端与核心网侧的媒体服务有如下星型拓扑结果,每个终端都会通过RTP(实时传输协议,Real-time Transport Protocol)将各自的上行语音数据发送给媒体服务器,而媒体服务器也会将电话会议下行数据发送给每个终端。媒体服务器可在每个成员刚刚加入会议的阶段(比如,接入电话会议的前10秒)进行终端上行语音数据采样。因为一般接通电话时,通话双方都会说"喂~,你好!"等,进行简单的寒暄。并按照接入电话会议的先后顺序,依次进行采样。
(2)媒体按照声音从低频到高频进行排序。
在媒体服务器得到采样数据后,调用声纹识别软件/硬件的功能,对每个成员的上行声音数据进行分析,计算出频率值,生成一个从低到高的排序结果。
(3)媒体服务器根据排序结果结合方位感混音函数进行混音。
得到排序结果后,媒体服务器根据预设的混音规则完成混音过程(混音 过程就是调整音频特征的过程)。
此混音规则可以考虑两个维度:成员位置(希望哪个成员在哪个虚拟位置上)、成员声音特性。其中,位置可以分为左远、左近、中远、中近、右远、右近,6个位置;而成员声音特性则可划分为高频\低频两个组别,或者可以简单的根据性别(声音的频率高低)来分组。
PS域的volte电话会议成员最大支持6路,即除本机外,有6个终端参与。通过对声纹的解析,将3个成员分配到低频率组,将3个成员分配到高频率组,给低频组分配虚拟空间位置中靠近本机的一侧,给高频组分配虚拟空间位置中的远侧。
而在每个频率组中,将成员按照频率从低、中、高,对应分配左、中、右的位置,这样正好是6个虚拟空间位置,如图5所示。混音时,声音的强度按照左、中、右位置每个1/3的方式划分,而远侧和近侧可以按0.5比0.5、0.45比0.55、0.4比0.6等方式划分。该过程最大化的隔离每个成员之间声音的干扰,增加了信噪比,提高了声音的识别率。
实现时,如果电话会议是3路,则仅按左、中、右位置每个1/3进行混音即可。如果多于6路,则可细分位置,其中,n为会议成员个数,相邻两个虚拟空间位置所在方向夹角计算公式为180/((N-1)/2-1),其中N为参会总人数,此实施例例中N为9,因此夹角为60度,其设置示意如图6所示。
被分配到正左侧虚拟空间位置的电话会议成员的语音数据,仅混入左声道;被分配到正右侧虚拟空间位置的电话会议成员的语音数据,仅混入右声道;而其他处于非左非右虚拟空间位置的成员的语音数据按照角度值计算混入左右声道的比例,计算方式如下:
确认虚拟位置偏左侧还是偏右侧。如果是偏左侧,确定偏离左侧水平线的角度值A,A取值范围是0-90度。那么,左声道的比重=1-右声道的比重;右声道的比重=Tan(A/2)/2。如果是偏右侧,确定偏离右侧水平线的角度值B,B取值范围是0-90度。那么,右声道的比重=1-左声道的比重;左声道的比重=Tan(B/2)/2。
在本实施例中,当获取到了每个会议成员的声音数据后,就可以为每 个会议成员分配虚拟座位了,即为每个会议成员设置了一个该会议成员对应的虚拟位置,即相当于想让用户A坐在椅子1上,想让用户B坐在椅子2上,想让用户C坐在椅子3上等。该过程是为了达到将声音数据相近的用户设置在不同的方位上。在设置了希望会议成员入座的虚拟位置后,对每个会议成员的声音而进行调音,即为每个会议成员设置音频特征,进而达到希望呈现的每个会议成员坐入对应虚拟座位的情况。
本发明实施例从整体上实现了在电话会议中区分每个会议成员的效果,用户体验较高。
本发明实施例还提供一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现上述音频的处理方法。
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件(例如处理器)完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,例如通过集成电路来实现其相应功能,也可以采用软件功能模块的形式实现,例如通过处理器执行存储于存储器中的程序/指令来实现其相应功能。本发明实施例不限制于任何特定形式的硬件和软件的结合。
尽管为示例目的,已经公开了本发明的实施例,本领域的技术人员将意识到各种改进、增加和取代也是可能的,因此,本申请的范围应当不限于上述实施例。
工业实用性
本发明实施例在立体声多人会话的过程中,对每个会话成员的语音数据进行采样,再根据每个会话成员的音频频率值设置每个会话成员发言时的音频特征,不同的音频特征可以呈现不同的立体声效果,就可以让会话成员在收听其他用户发言时,带来一种虚拟坐席的感觉,好像每个会话成员都处于不同的虚拟坐席上,更加容易区分发言的会话成员,用户体验较好。

Claims (14)

  1. 一种音频的处理方法,包括:
    对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值;
    根据音频频率值设置每个会话成员发言时的音频特征,其中,所述音频特征为会话成员发言时每个声道的音量比重。
  2. 如权利要求1所述的处理方法,其中,根据音频频率值设置每个会话成员发言时的音频特征,包括:
    判断是否存在处于同一预设音频频率范围的音频频率值;
    如果存在,则为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;
    如果不存在,则为每个会话成员设置相同或不同的音频特征。
  3. 如权利要求1所述的处理方法,其中,根据音频频率值设置每个会话成员发言时的音频特征,包括:
    将每个会话成员的音频频率值进行排序;
    为所述排序中音频频率值相邻的会话成员设置不同的音频特征。
  4. 如权利要求1所述的处理方法,其中,按照如下方式为所述会话成员设置音频特征:
    获取会话成员每个声道的音量信息;
    从计算得到的多个声道音量比重中获取一个声道音量比重;
    根据所述声道音量比重调整所述会话成员每个声道的音量信息。
  5. 如权利要求1至4中任一项所述的处理方法,其中,根据音频频率值设置每个会话成员发言时的音频特征之后,还包括:
    在任一会话成员发言的情况下,获取发言的会话成员的音频特征;
    根据所述音频特征调整输入除所述发言的会话成员外的会话成员音频播放器的音频。
  6. 一种音频的处理装置,包括:
    采样模块,设置为对立体声多人会话的每个会话成员的语音数据进行采样,以确定每个会话成员的音频频率值;
    设置模块,设置为根据音频频率值设置每个会话成员发言时的音频特征,其中,所述音频特征为会话成员发言时每个声道的音量比重。
  7. 如权利要求6所述的处理装置,其中,所述设置模块包括:
    判断单元,设置为判断是否存在处于同一预设音频频率范围的音频频率值;
    第一设置单元,设置为在存在处于同一预设音频频率范围的音频频率值的情况下,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;在不存在处于同一预设音频频率范围的音频频率值的情况下,为每个会话成员设置相同或不同的音频特征。
  8. 如权利要求6所述的处理装置,其中,所述设置模块包括:
    排序单元,设置为将每个会话成员的音频频率值进行排序;
    第二设置单元,设置为为所述排序中音频频率值相邻的会话成员设置不同的音频特征。
  9. 如权利要求6所述的处理装置,其中,所述设置模块设置为按照如下方式为所述会话成员设置音频特征:
    获取会话成员每个声道的音量信息;从计算得到的多个声道音量比重中获取一个声道音量比重;根据所述声道音量比重调整所述会话成员每个声道的音量信息。
  10. 如权利要求6至9中任一项所述的处理装置,还包括:
    获取模块,设置为在任一会话成员发言的情况下,获取发言的会话成员的音频特征;
    输入模块,设置为根据所述音频特征调整输入除所述发言的会话成员外的会话成员播放器的音频。
  11. 一种媒体服务器,包括:
    采集器,对立体声多人会话的每个会话成员的语音数据进行采样;
    处理器,确定采集到的每个会话成员的音频频率值,根据音频频率值设置每个会话成员发言时的音频特征,其中,所述音频特征为会话成员发言时每个声道的音量比重。
  12. 如权利要求11所述的媒体服务器,其中,
    所述处理器,设置为判断是否存在处于同一预设音频频率范围的音频频率值;在存在处于同一预设音频频率范围的音频频率值的情况下,为处于同一预设音频频率范围的每个会话成员设置均不相同的音频特征;在不存在处于同一预设音频频率范围的音频频率值的情况下,为每个会话成员设置相同或不同的音频特征。
  13. 如权利要求11所述的媒体服务器,其中,
    所述处理器,设置为将每个会话成员的音频频率值进行排序,并为所述排序中音频频率值相邻的会话成员设置不同的音频特征。
  14. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1-5任一项的音频的处理方法。
PCT/CN2017/082884 2016-11-23 2017-05-03 一种音频的处理方法、装置和媒体服务器 WO2018094968A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611037628.4A CN108109630B (zh) 2016-11-23 2016-11-23 一种音频的处理方法、装置和媒体服务器
CN201611037628.4 2016-11-23

Publications (1)

Publication Number Publication Date
WO2018094968A1 true WO2018094968A1 (zh) 2018-05-31

Family

ID=62194750

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/082884 WO2018094968A1 (zh) 2016-11-23 2017-05-03 一种音频的处理方法、装置和媒体服务器

Country Status (2)

Country Link
CN (1) CN108109630B (zh)
WO (1) WO2018094968A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633091B (zh) * 2020-12-09 2021-11-16 北京博瑞彤芸科技股份有限公司 一种验证真实会面的方法和系统
CN115361474A (zh) * 2022-08-18 2022-11-18 上海复旦通讯股份有限公司 一种在电话会议中辅助识别声源的方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999041880A1 (en) * 1998-02-12 1999-08-19 Qsound Labs, Inc. Teleconferencing method and apparatus with three-dimensional sound positioning
CN101118746A (zh) * 2007-09-17 2008-02-06 腾讯科技(深圳)有限公司 基于双声道音频数据标定的方法及多人语音通话系统
CN104811318A (zh) * 2015-04-15 2015-07-29 南京农业大学 一种声音控制语音交流的方法
CN104898091A (zh) * 2015-05-29 2015-09-09 复旦大学 基于迭代优化算法的麦克风阵列自校准声源定位系统
CN105049993A (zh) * 2014-02-05 2015-11-11 森海塞尔通信公司 包括依赖于音量控制的均衡的扬声器系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4712231A (en) * 1984-04-06 1987-12-08 Shure Brothers, Inc. Teleconference system
FI109158B (fi) * 2000-06-26 2002-05-31 Nokia Corp Kannettava laite ja menetelmä antaa käyttäjälle informaatiota kannettavan laitteen toiminnasta
CN102651178B (zh) * 2011-02-28 2015-08-12 中兴通讯股份有限公司 胎教辅助装置及其应用方法、用户终端
JP2013097076A (ja) * 2011-10-29 2013-05-20 Shimon Gomi 外国語声域自己発見システム
CN102969003A (zh) * 2012-11-15 2013-03-13 东莞宇龙通信科技有限公司 摄像声音提取方法及装置
CN104363510B (zh) * 2014-10-29 2019-04-30 深圳市金立通信设备有限公司 一种播放终端
CN104410379B (zh) * 2014-10-29 2019-05-14 深圳市金立通信设备有限公司 一种音量调节方法
CN105141730B (zh) * 2015-08-27 2017-11-14 腾讯科技(深圳)有限公司 音量控制方法及装置
CN105741833B (zh) * 2016-03-14 2021-05-11 腾讯科技(深圳)有限公司 语音通信数据处理方法和装置
CN105791606A (zh) * 2016-04-15 2016-07-20 深圳美亚美科技有限公司 一种多人会话控制系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999041880A1 (en) * 1998-02-12 1999-08-19 Qsound Labs, Inc. Teleconferencing method and apparatus with three-dimensional sound positioning
CN101118746A (zh) * 2007-09-17 2008-02-06 腾讯科技(深圳)有限公司 基于双声道音频数据标定的方法及多人语音通话系统
CN105049993A (zh) * 2014-02-05 2015-11-11 森海塞尔通信公司 包括依赖于音量控制的均衡的扬声器系统
CN104811318A (zh) * 2015-04-15 2015-07-29 南京农业大学 一种声音控制语音交流的方法
CN104898091A (zh) * 2015-05-29 2015-09-09 复旦大学 基于迭代优化算法的麦克风阵列自校准声源定位系统

Also Published As

Publication number Publication date
CN108109630A (zh) 2018-06-01
CN108109630B (zh) 2022-01-25

Similar Documents

Publication Publication Date Title
US9084079B2 (en) Selectively formatting media during a group communication session
US8606249B1 (en) Methods and systems for enhancing audio quality during teleconferencing
EP2158752B1 (en) Methods and arrangements for group sound telecommunication
JP4255461B2 (ja) 電話会議用のステレオ・マイクロフォン処理
EP2420048B1 (en) Systems and methods for computer and voice conference audio transmission during conference call via voip device
US20100262419A1 (en) Method of controlling communications between at least two users of a communication system
US7433716B2 (en) Communication apparatus
JP2003500935A (ja) エッジポイントミキシングを有する遠隔会議ブリッジ
CN107333093B (zh) 一种声音处理方法、装置、终端及计算机可读存储介质
US9350869B1 (en) System and methods for selective audio control in conference calls
TW200828867A (en) Communication system
US20210250391A1 (en) System and method for multi-party communication of trending topics
US10362394B2 (en) Personalized audio experience management and architecture for use in group audio communication
WO2018094968A1 (zh) 一种音频的处理方法、装置和媒体服务器
US8412171B2 (en) Voice group sessions over telecommunication networks
US20220392478A1 (en) Speech enhancement techniques that maintain speech of near-field speakers
US20120150542A1 (en) Telephone or other device with speaker-based or location-based sound field processing
Hyrkas et al. Spatialized Audio and Hybrid Video Conferencing: Where Should Voices be Positioned for People in the Room and Remote Headset Users?
JP2017519379A (ja) オブジェクトベースの遠隔会議プロトコル
US20230421620A1 (en) Method and system for handling a teleconference
Aguilera et al. Spatial audio for audioconferencing in mobile devices: Investigating the importance of virtual mobility and private communication and optimizations
US11425258B2 (en) Audio conferencing in a room
US20220394070A1 (en) Multi-conversational social networking
CN116633908A (zh) 传输连接构建方法以及系统
WO2022008075A1 (en) Methods, system and communication device for handling digitally represented speech from users involved in a teleconference

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17874102

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17874102

Country of ref document: EP

Kind code of ref document: A1