WO2010094219A1 - Method and device for processing and reproducing speech signals - Google Patents

Method and device for processing and reproducing speech signals Download PDF

Info

Publication number
WO2010094219A1
WO2010094219A1 PCT/CN2010/070491 CN2010070491W WO2010094219A1 WO 2010094219 A1 WO2010094219 A1 WO 2010094219A1 CN 2010070491 W CN2010070491 W CN 2010070491W WO 2010094219 A1 WO2010094219 A1 WO 2010094219A1
Authority
WO
WIPO (PCT)
Prior art keywords
site
largest
frequency band
information
orientation
Prior art date
Application number
PCT/CN2010/070491
Other languages
French (fr)
Chinese (zh)
Inventor
梁丽燕
刘智辉
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Publication of WO2010094219A1 publication Critical patent/WO2010094219A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/4223Cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting

Definitions

  • the present invention relates to the field of video communication technologies, and in particular, to a method and apparatus for processing and playing a voice signal.
  • each conference site participating in the conference encodes the local voice signal and image signal and sends it to the MCU (Multipoint Control Unit), and the MCU processes the received voice signal and image signal, and processes the signal.
  • the subsequent voice signal and image signal are sent to each venue terminal, and the conference field plays the decoded voice signal and the image signal, thereby realizing video communication.
  • the MCU calculates the envelope of the voice signal of each site after the speech signal is processed, and compares the envelopes of the voice signal to the N sites with the largest envelope as the largest N-party venue, and then maximizes The voice signal of the N-party site is mixed and sent to other sites outside the largest N-party site in the conference.
  • the voice signal received by the largest N-party site is the voice signal of the largest N-1 party site other than the site where it is located. Mix signal. Therefore, after the conference site decodes the received mix signal, the other N-party sites can hear the voice of the largest N-party site, and the largest N-party can hear the other N-1. The voice of the party venue.
  • Embodiments of the present invention provide a method and apparatus for mixing and playing a voice signal to improve the spatial hearing effect of the video conference.
  • the embodiment of the invention discloses a method for processing a voice signal, comprising: determining, at a maximum N-party venue, each time in the mixed signal according to the orientation information set for the meeting place participating in the conference Azimuth information of the site with the largest energy in the frequency band; the sounding signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time are sent to the conference terminal participating in the conference.
  • the embodiment of the invention further discloses a method for playing a voice signal, comprising: acquiring a mixing signal of a maximum N-party venue and a position information of a site with the largest energy in each frequency band at each time; according to the auditory space parameter of the playing device Corresponding relationship between the orientation information, obtaining an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the largest energy at each frequency band at each time; adjusting the mix by using the auditory spatial parameter of the playback device Signal, and play the adjusted mix signal.
  • the embodiment of the invention further discloses a processing device for a voice signal, comprising: an orientation determining unit, configured to determine each moment in the mixed signal in the largest N-party venue according to the orientation information set for the meeting place participating in the conference The orientation information of the site with the largest energy in each frequency band; the sending unit, configured to send the sound mixing signal of the largest N-party meeting site and the position information of the site with the largest energy in each frequency band at each time to the conference terminal participating in the meeting .
  • the embodiment of the invention further discloses a playback device for a voice signal, comprising: an acquisition unit, configured to acquire a sound mixing signal of a maximum N-party venue and a position information of a site with the largest energy in each frequency band at each time; a unit, configured to obtain, according to a correspondence between the auditory spatial parameter and the azimuth information of the playback device, an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the largest energy in each frequency band at each time; And for adjusting the mixing signal by using an auditory space parameter of the playing device, so as to play the adjusted mixing signal.
  • the orientation information when processing the voice signal, the orientation information is set in advance for all the sites participating in the conference, and in the largest N-party conference site, the orientation of the site with the largest energy in each frequency band at each moment is determined. Information, the orientation information is sent together with the mixing signal of the largest N-party venue.
  • the spatial parameters of each playing device at the playing end are obtained, and the spatial parameters of the playing device are used to adjust the mixing signal.
  • the auditory space of each mixing site can be reconstructed at the venue, so that the sound of the largest N-party venue has a spatial stereoscopic feeling during playback, and the user can clearly understand each of the largest N-party venues.
  • the sound of the user adds to the user's experience of the spot.
  • FIG. 2-a is a schematic diagram of a position of 10 conference venues
  • Figure 2-b is a schematic diagram of the orientation of four sites in a multi-screen
  • Figure 3-a shows the orientation of the four largest 4-party venues
  • Figure 3-b is a schematic diagram of the orientation of four sites in a multi-screen
  • Figure 4 shows the setting method of the orientation when the number of multi-screens is 16, and the number of orientations is 4;
  • Figure 5 is a schematic diagram of processing of a voice signal in the present invention.
  • FIG. 6 is a structural diagram of a method for processing a voice signal according to Embodiment 2 of the present invention
  • FIG. 7 is a flowchart of a method for playing a voice signal according to Embodiment 3 of the present invention
  • FIG. 9 is a structural diagram of a device for playing a voice signal according to Embodiment 4 of the present invention.
  • FIG. 1 is a flowchart of a method for processing a voice signal according to the present invention, where the method includes the following steps:
  • Step 101 Determine, according to the orientation information set for the conference venue participating in the conference, the location information of the site with the largest energy in each frequency band at each moment in the mixed signal in the maximum N-party conference site;
  • the voice signal of the largest N-party site is first time-frequency transformed, the voice signal in the time domain is converted into a voice signal in the frequency domain, and then the energy value in each frequency band at each time is calculated, and each time is obtained.
  • the site with the highest energy in each frequency band is finally determined according to the orientation information set for the meeting place participating in the meeting, and the orientation information of the site with the largest energy in each frequency band is determined.
  • the location information of the site with the largest energy in the largest N-party site in each frequency band at each time can be determined by two methods.
  • a method for determining the method is as follows: According to the order of joining the conference sites participating in the conference, the orientation of the conference site is set in advance. When comparing the energy of each band in the speech signal by comparing the largest N-party venue After obtaining the site with the largest energy in the largest N-party site in each frequency band at each time, it is determined whether the site with the largest energy is in the multi-screen, and if so, the orientation information of the site with the largest energy is set to be more The screen orientation information, if not, sets the orientation information of the site with the largest energy to the preset orientation information. For example, in a videoconferencing system, there are ten venues participating in the conference. The first conference site number is 1, the second conference site number is 2, and so on. The tenth conference site number is 10.
  • the orientation of field 1-3 is set to the upper left
  • the orientation of field 4-6 is set to the upper right
  • the orientation of field 7-8 is set to the lower left, which will be 9-10.
  • the orientation is set to the lower right, please refer to Figure 2-a.
  • Figure 2-a shows the orientation of the 10 joining venues.
  • the site 1-4 is the largest 4-party site, and in a certain frequency band at a certain time, the site 1 is the site with the largest energy in the largest 4-party site, and it is determined whether the site 1 is in the multi-screen, when the site 1 In multi-screen, set the orientation information of field 1 in multi-screen to the orientation information of site 1.
  • site 1 is at the bottom right of the multi-screen, see Figure 2-b, Figure 2-b is multi-screen.
  • the orientation information of the four sites is the right lower part of the site.
  • the orientation information of the site is obtained.
  • the orientation information of the site 1 is the upper left.
  • the other method is as follows: After determining the maximum N-party site, the orientation of the largest N-party site is set in advance according to the order of joining the largest N-party site, and the orientation information of the largest N-party site is obtained.
  • the orientation information of the site with the largest energy is set as the orientation information of the site with the largest energy in the multi-screen, and if not, the site with the largest energy.
  • the orientation information is set to the orientation information of the preset maximum N-party venue. Take the video communication between the above ten sites as an example.
  • the venues 1-4 are the maximum 4-party venues. According to the order of joining the venues 1-4, the orientation of the field 1 is set to the upper left, and the orientation of the field 2 is set. For the upper right, set the orientation of field 3 to the lower left, and set the orientation of field 4 to the lower right. See Figure 3-a. Figure 3-a shows the orientation of the four largest 4-party venues.
  • the orientation information of the site 1 in the multi-screen is set to the orientation information of the site 1.
  • the site 1 is at the lower right of the multi-screen.
  • Figure 3-b shows the orientation of the four sites in the multi-screen.
  • the location information of field 1 is When the site 1 is not in the multi-screen, it can be obtained according to the preset position of the maximum 4-party site.
  • the orientation information of the site 1 is the upper left.
  • the orientation information of the site with the largest energy also changes correspondingly with the change of the orientation.
  • the conference site 1-4 is the maximum 4-party venue.
  • the orientation of the field 1 is set to the upper left
  • the orientation of the field 2 is Set to the upper right, set the orientation of field 3 to the lower left, and set the orientation of field 4 to the lower right.
  • the site 1 is the site with the largest energy in the largest 4-party site
  • the orientation information of the field 1 is the orientation information of the site 1 in the multi-screen, assuming The orientation of the site 1 in the multi-screen is the upper left, and the orientation information of the field 1 is the upper left.
  • the orientation information of field 1 changes accordingly to the upper right.
  • the method for setting the orientation information of the site with the largest energy in the largest N-party site is not limited, and the orientation information is not limited to the four directions of the upper left, the upper right, the lower left, and the lower right.
  • the site in the multi-picture cannot completely correspond to any one of the positions.
  • the site in the multi-picture cannot correspond to any one of the positions, and the most similar position is taken for the site in the multi-picture.
  • 4 is a setting method in which the number of multi-pictures is 16, and the number of orientations is 4, and the orientation of the venue 7 in the figure is set to the upper right according to the approximation principle.
  • Step 102 Send the mixed signal of the largest N-party venue and the orientation information of the site with the highest energy in each frequency band at each time.
  • the sound mixing signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time are first encoded, respectively, and the mixed code stream and the position information stream are respectively obtained, and then the sound is mixed.
  • the code stream and the azimuth information code stream are sent to the site terminal participating in the conference; or, only the mixed signal of the largest N-party site may be encoded to obtain a mixed code stream, and then the mixed code stream and each time of each time
  • the location information of the site with the largest energy in the frequency band is sent to the site terminal participating in the conference. For example, if the destination site belongs to the largest N-party site, the mix signal sent to the site is the mix signal of the largest N-1 site other than the site.
  • FIG. 5 is a schematic diagram of processing of a voice signal according to the present invention.
  • FIG. 6 is a structural diagram of a processing apparatus for a voice signal according to the present invention.
  • the apparatus includes an orientation determining unit 601 and a transmitting unit 602. The internal structure and connection relationship will be further described below in conjunction with the working principle of the device.
  • the position determining unit 601 is configured to determine, according to the orientation information set for the meeting place of the meeting, the orientation information of the site with the largest energy in each frequency band at each time in the largest N-party meeting place;
  • the sending unit 602 is configured to send the sound mixing signal of the largest N-party venue and the orientation information of the site with the largest energy in each frequency band at each time.
  • the orientation determining unit 601 may include: a first orientation determining unit 603, configured to pre-set an orientation for the conference site participating in the conference according to the order of joining, to obtain preset orientation information; and comparing unit 604, for comparing The maximum value of the energy value of the voice signal of each of the N-party sites in each frequency band is obtained, and the first setting unit 605 is configured to: when the site with the largest energy is not in the multi-picture, according to The preset orientation information sets the orientation information of the site with the largest energy; the second setting unit 606 is configured to set the orientation information of the site with the largest energy according to the multi-screen orientation information when the site with the largest energy is in the multi-screen.
  • the orientation determining unit 601 may further include: a second orientation presetting unit, configured to pre-set the orientation for the largest N-party venue according to the order of joining, and obtain preset orientation information of the largest N-party venue; a comparison unit, configured to compare energy values of each frequency band of the voice signal of the largest N-party site at each time, to obtain a site with the largest energy in each frequency band at each time; and a third setting unit, configured to use the maximum energy When the site is not in the multi-screen, the orientation information of the site with the largest energy is set according to the preset orientation information.
  • the fourth setting unit is configured to set the maximum energy according to the multi-screen orientation information when the site with the largest energy is in the multi-screen. Location information of the venue.
  • the sending unit 602 may include: a first sending unit 607 and/or a second sending unit 608, where the first sending unit 607 is configured to use the mixed signal and the maximum energy in each frequency band at each time.
  • the orientation information is encoded, and the mixed code stream and the position information stream are respectively obtained, and the mixed code stream and the position information code stream are sent to the conference terminal participating in the conference;
  • the second sending unit 608 is configured to encode the mixed signal to obtain a mixed code stream, and send the mixed code stream and the position information of the site with the largest energy in each frequency band at each time to participate.
  • the venue terminal of the conference Embodiment 3 Referring to FIG. 7, FIG. 7 is a flowchart of a method for playing a voice signal according to the present invention, and the method includes the following steps:
  • Step 701 Acquire a mixing signal of a maximum N-party site and a position information of a site with the largest energy in each frequency band at each time;
  • the location information of the site with the largest energy in the largest N-party site is determined from the location information of the largest N-party site based on the site number.
  • Step 702 According to the correspondence between the auditory space parameter and the orientation information of the playback device, Obtaining an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the highest energy on each frequency band at each time;
  • the auditory spatial parameters of the playback device include a level parameter and a delay parameter.
  • the specific implementation process of step 902 may be: firstly setting a level parameter and a delay parameter corresponding to the azimuth information for the playback device, and acquiring, in step 701, the orientation information of the site with the highest energy in each frequency band at each time. After that, the corresponding relationship between the orientation information set by the playback device and the level parameter and the delay parameter is queried, and the level parameter of the playback device corresponding to the orientation information of the site with the largest energy at each time band is obtained. And delay parameters.
  • the position information of the site with the largest energy in a certain frequency band acquired is the upper left
  • the level parameters and delay parameters of the two speakers can be obtained as follows: 1) Speaker 1 Level parameter in the upper left; 2) Level parameter in the upper left of speaker 2; 3) Delay parameter in the upper left of speaker 1; 4) Delay parameter in the upper left of speaker 2.
  • Step 703 Adjust the mixing signal by using the auditory space parameter of the playing device to play the adjusted mixing signal.
  • the time-frequency conversion of the mixed signal is first performed, and the mixed signal in the time domain is converted into the mixed signal in the frequency domain, and the orientation information corresponding to the site with the largest energy in each frequency band is obtained.
  • the level and delay of the mixing signal in the frequency domain are adjusted by using the auditory spatial parameters of the playing device in each frequency band.
  • Figure 8 and Figure 8 for the adjustment of the auditory space parameters of the playback device in each frequency band.
  • FIG. 9 is a structural diagram of a playback apparatus for a voice signal according to the present invention.
  • the apparatus includes an acquisition unit 901, a spatial parameter obtaining unit 902, and an adjustment unit 903.
  • the internal structure and connection relationship will be further described below in conjunction with the working principle of the device.
  • the obtaining unit 901 is configured to acquire a mixing signal of a maximum N-party venue and each frequency band at each moment The orientation information of the site with the largest energy;
  • the spatial parameter obtaining unit 902 is configured to obtain, according to the correspondence between the auditory spatial parameter and the orientation information of the playback device, the auditory space of the playback device corresponding to the orientation information of the site with the largest energy in each frequency band at each time.
  • the adjusting unit 903 is configured to adjust the mixed signal by using an auditory space parameter of the playing device, so as to play the adjusted mixed signal.
  • the obtaining unit 901 may include:
  • a first receiving unit 904 configured to receive a mixed code stream and a position information stream
  • the first decoding unit 905 is configured to decode the mixed code stream and the azimuth information code stream to obtain the mixed signal and the orientation information of the site with the largest energy in each frequency band at each time.
  • the first receiving unit 904 may be replaced by a second receiving unit, configured to receive the mixed code stream and the orientation information of the site with the largest energy in each frequency band at each time; the first decoding unit 905 may be replaced with the second decoding. And a unit, configured to decode the mixed code stream to obtain the mixed signal.
  • the obtaining unit 901 may further include a first receiving unit, a first decoding unit, and a second receiving unit, and a second decoding unit.
  • the spatial parameter obtaining unit 902 may include:
  • the auditory spatial parameter preset unit 906 is configured to preset a level parameter and a delay parameter corresponding to the orientation information for the playback device;
  • the query unit 907 is configured to query a correspondence between the orientation information and the level parameter and the delay parameter, and obtain a level parameter corresponding to the orientation information of the site with the largest energy in each frequency band at each moment. Delay parameter.
  • Fig. 9 does not define a complete structural diagram of a playback apparatus for a voice signal, but merely highlights the part involved in the inventive aspect of the present invention. It will be clear to those skilled in the art that the playback device of the voice signal should also include a player, and the adjusted mix signal output by the adjustment unit 903 is used as an input signal of the player. The player plays the adjusted mix signal.
  • the orientation information is set in advance for all the sites participating in the conference, and in the largest N-party conference site, the orientation of the site with the largest energy in each frequency band is determined. Information that transmits the orientation information along with the mixing signal.
  • the spatial parameters of each playback device at the playback end are obtained, and the spatial parameters of the playback device are used to adjust the mixing signal, which will be adjusted.
  • the auditory space of the sound source can be reconstructed in the venue, so that the sound of the largest N-party venue has a spatial stereoscopic feeling during playback, and the user can clearly understand the sound of each of the largest N-party venues, and further increases the sound.
  • the user's experience of the spot experience is not limited to, but not limited to, but not limited to, but not limited to, but not limited to, but not limited to, but not limited to, buthepta, a spatial stereoscopic feeling during playback, and the user can clearly understand the sound of each of the largest N-party venues, and further increases the sound.
  • the orientation information of the site with the largest energy will change accordingly with the change of the orientation of the site in the multi-picture, so that the orientation of the source is made when the voice signal is played.
  • the orientation of the images is consistent, further increasing the user's experience of the presence experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A method and device for processing and reproducing speech signals are provided, wherein said processing method includes: according to location information set for conference halls participating in the conference, determining the location information of the conference hall having the maximum energy at each time instant and in each frequency band among the maximum N-party conference halls; transmitting the mixed speech signals of the maximum N-party conference halls and the location information of the conference hall having the maximum energy at each time instant and in each frequency band to terminals in the conference halls participating in the conference. The reproducing method includes: obtaining the mixed speech signals and the location information of the conference hall having the maximum energy in each frequency band; according to the correspondences between auditory spatial parameters of a reproduction device and location information, obtaining the auditory spatial parameters of the reproduction device corresponding to the location information of the conference hall having the maximum energy in each frequency band; adjusting said mixed speech signals by using the auditory spatial parameters of the reproduction device; and reproducing the mixed speech signals after being adjusted. According to the embodiment of present invention, the spatially auditory effect of the video conference is improved.

Description

一种语音信号的处理、 播放方法和装置  Processing and playing method and device for voice signal
本申请要求于 2009 年 2 月 19 日提交中国专利局、 申请号为 200910005681.X, 发明名称为"一种语音信号的处理、 播放方法和装置"的中国 专利申请的优先权, 其全部内容通过引用结合在本申请中。  This application claims priority to Chinese Patent Application No. 200910005681.X, filed on February 19, 2009, entitled "Processing, Playback, and Device for Voice Signals", the entire contents of which are hereby incorporated by reference. The citations are incorporated herein by reference.
技术领域 Technical field
本发明涉及视频通信技术领域, 特别是涉及一种语音信号的处理、播放方 法和装置。  The present invention relates to the field of video communication technologies, and in particular, to a method and apparatus for processing and playing a voice signal.
背景技术 Background technique
在视频通信系统中,参加会议的各个会场把本地的语音信号和图像信号编 码后发送给 MCU ( Multipoint Control Unit, 多点控制单元), MCU对接收的 语音信号和图像信号进行处理,并把处理后的语音信号和图像信号发送至各个 会场终端, 由会场在对语音信号和图像信号解码后播放, 由此实现视频通信。 其中, MCU在对语音信号进行处理时,先计算解码后的每个会场的语音信号 的包络,通过比较语音信号的包络将包络最大的 N个会场作为最大 N方会场, 然后把最大 N方会场的语音信号进行混音处理后发送给会议中最大 N方会场 以外的其它会场, 而最大 N方会场接收的语音信号是除自身所在会场以外的 其它最大 N-1方会场的语音信号的混音信号。 因此,会议中各会场在对接收到 的混音信号解码后,最大 N方会场外的其他会场能听到最大 N方会场的语音, 而最大 N方会场之间能听到其他最大 N-1方会场的语音。  In the video communication system, each conference site participating in the conference encodes the local voice signal and image signal and sends it to the MCU (Multipoint Control Unit), and the MCU processes the received voice signal and image signal, and processes the signal. The subsequent voice signal and image signal are sent to each venue terminal, and the conference field plays the decoded voice signal and the image signal, thereby realizing video communication. The MCU calculates the envelope of the voice signal of each site after the speech signal is processed, and compares the envelopes of the voice signal to the N sites with the largest envelope as the largest N-party venue, and then maximizes The voice signal of the N-party site is mixed and sent to other sites outside the largest N-party site in the conference. The voice signal received by the largest N-party site is the voice signal of the largest N-1 party site other than the site where it is located. Mix signal. Therefore, after the conference site decodes the received mix signal, the other N-party sites can hear the voice of the largest N-party site, and the largest N-party can hear the other N-1. The voice of the party venue.
但是, 发明人在研究中发现, 现有技术中, 在 MCU对最大 N方会场的语 音信号进行混音处理时, 只是将最大 N方会场的语音信号做筒单的线性叠加。 当出现最大 N方会场同时说话的情况时, 各个入会会场的输出设备播放的声 音为各个最大 N方会场的声音混杂和重叠在一起的语音, 使参加会议的用户 无法听清楚每个最大 N方会场的声音, 从而影响了视讯会议的视听效果。 发明内容  However, the inventors found in the prior art that in the prior art, when the MCU mixes the speech signals of the largest N-party venue, only the speech signals of the largest N-party venue are linearly superimposed. When the maximum N-party site is spoken at the same time, the sound played by the output devices of each participating site is the mixed and overlapping voice of each of the largest N-party sites, so that the users participating in the conference cannot hear each of the largest N-party. The sound of the venue, which affects the audiovisual effect of the video conference. Summary of the invention
本发明实施例提供了一种语音信号的混音、播放方法和装置, 以提高视讯 会议的空间听觉效果。  Embodiments of the present invention provide a method and apparatus for mixing and playing a voice signal to improve the spatial hearing effect of the video conference.
本发明实施例公开了一种语音信号的处理方法, 包括: 根据为参加会议的 会场所设置的方位信息, 在最大 N方会场中, 确定混音信号中每个时刻每个 频段上能量最大的会场的方位信息; 将最大 N方会场的混音信号和所述每个 时刻每个频段上能量最大的会场的方位信息发送给参加会议的会场终端。 The embodiment of the invention discloses a method for processing a voice signal, comprising: determining, at a maximum N-party venue, each time in the mixed signal according to the orientation information set for the meeting place participating in the conference Azimuth information of the site with the largest energy in the frequency band; the sounding signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time are sent to the conference terminal participating in the conference.
本发明实施例还公开了一种语音信号的播放方法, 包括: 获取最大 N方 会场的混音信号和每个时刻每个频段上能量最大的会场的方位信息;根据播放 设备的听觉空间参数与方位信息之间的对应关系,获得与每个时刻每个频段上 所述能量最大的会场的方位信息相对应的播放设备的听觉空间参数;利用所述 播放设备的听觉空间参数调整所述混音信号, 并对调整后的混音信号进行播 放。  The embodiment of the invention further discloses a method for playing a voice signal, comprising: acquiring a mixing signal of a maximum N-party venue and a position information of a site with the largest energy in each frequency band at each time; according to the auditory space parameter of the playing device Corresponding relationship between the orientation information, obtaining an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the largest energy at each frequency band at each time; adjusting the mix by using the auditory spatial parameter of the playback device Signal, and play the adjusted mix signal.
本发明实施例还公开了一种语音信号的处理装置, 包括: 方位确定单元, 用于根据为参加会议的会场所设置的方位信息, 在最大 N方会场中, 确定混 音信号中每个时刻每个频段上能量最大的会场的方位信息; 发送单元, 用于将 最大 N方会场的混音信号和所述每个时刻每个频段上能量最大的会场的方位 信息发送给参加会议的会场终端。  The embodiment of the invention further discloses a processing device for a voice signal, comprising: an orientation determining unit, configured to determine each moment in the mixed signal in the largest N-party venue according to the orientation information set for the meeting place participating in the conference The orientation information of the site with the largest energy in each frequency band; the sending unit, configured to send the sound mixing signal of the largest N-party meeting site and the position information of the site with the largest energy in each frequency band at each time to the conference terminal participating in the meeting .
本发明实施例还公开了一种语音信号的播放装置, 包括: 获取单元, 用于 获取最大 N方会场的混音信号和每个时刻每个频段上能量最大的会场的方位 信息; 空间参数获得单元, 用于根据播放设备的听觉空间参数与方位信息之间 的对应关系,获得与每个时刻每个频段上所述能量最大的会场的方位信息相对 应的播放设备的听觉空间参数; 调整单元, 用于利用所述播放设备的听觉空间 参数调整所述混音信号, 以便对调整后的混音信号进行播放。  The embodiment of the invention further discloses a playback device for a voice signal, comprising: an acquisition unit, configured to acquire a sound mixing signal of a maximum N-party venue and a position information of a site with the largest energy in each frequency band at each time; a unit, configured to obtain, according to a correspondence between the auditory spatial parameter and the azimuth information of the playback device, an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the largest energy in each frequency band at each time; And for adjusting the mixing signal by using an auditory space parameter of the playing device, so as to play the adjusted mixing signal.
由上述实施例可以看出, 在对语音信号进行处理时,预先为参加会议的所 有会场设置方位信息, 并在最大 N方会场中, 确定在每个时刻每个频段上能 量最大的会场的方位信息, 将所述方位信息与最大 N方会场的混音信号一起 发送。在对语音信号进行播放时,根据接收的方位信息以及方位信息的播放设 备空间参数之间的对应关系,得到播放端每个播放设备的空间参数, 利用播放 设备的空间参数来调整混音信号,在将调整后的混音信号播放时, 可以在会场 重构各混音会场的听觉空间, 使最大 N方会场的声音在播放时具有空间的立 体感觉, 用户能够听清楚每个最大 N方会场的声音, 更增加了用户的临场体 验感觉。  It can be seen from the foregoing embodiment that, when processing the voice signal, the orientation information is set in advance for all the sites participating in the conference, and in the largest N-party conference site, the orientation of the site with the largest energy in each frequency band at each moment is determined. Information, the orientation information is sent together with the mixing signal of the largest N-party venue. When playing the voice signal, according to the corresponding relationship between the received orientation information and the spatial parameters of the playing device, the spatial parameters of each playing device at the playing end are obtained, and the spatial parameters of the playing device are used to adjust the mixing signal. When the adjusted mixing signal is played, the auditory space of each mixing site can be reconstructed at the venue, so that the sound of the largest N-party venue has a spatial stereoscopic feeling during playback, and the user can clearly understand each of the largest N-party venues. The sound of the user adds to the user's experience of the spot.
附图说明 DRAWINGS
+ 图 1为本发明实施例一揭示的一种语音信号的处理方法的流程图; 图 2-a为 10个入会会场的方位示意图; + 1 is a flowchart of a method for processing a voice signal according to Embodiment 1 of the present invention; FIG. 2-a is a schematic diagram of a position of 10 conference venues;
图 2-b为多画面中 4个会场的方位示意图;  Figure 2-b is a schematic diagram of the orientation of four sites in a multi-screen;
图 3-a为 4个最大 4方会场的方位示意图;  Figure 3-a shows the orientation of the four largest 4-party venues;
图 3-b为多画面中 4个会场的方位示意图;  Figure 3-b is a schematic diagram of the orientation of four sites in a multi-screen;
图 4为多画面个数为 16, 方位个数为 4时方位的设置方法;  Figure 4 shows the setting method of the orientation when the number of multi-screens is 16, and the number of orientations is 4;
图 5为本发明中语音信号的处理示意图;  Figure 5 is a schematic diagram of processing of a voice signal in the present invention;
图 6为本发明实施例二揭示的一种语音信号的处理装置的结构图; 图 7为本发明实施例三揭示的一种语音信号的播放方法的流程图; 图 8为本发明各频段下播放设备的听觉空间参数调整示意图;  FIG. 6 is a structural diagram of a method for processing a voice signal according to Embodiment 2 of the present invention; FIG. 7 is a flowchart of a method for playing a voice signal according to Embodiment 3 of the present invention; Schematic diagram of adjusting the auditory spatial parameters of the playback device;
图 9为本发明实施例四揭示的一种语音信号的播放装置的结构图。  FIG. 9 is a structural diagram of a device for playing a voice signal according to Embodiment 4 of the present invention.
具体实施方式 detailed description
为使本发明的上述目的、特征和优点能够更加明显易懂, 下面结合附图对 本发明实施例进行详细描述。 实施例一 请参阅图 1 , 图 1为本发明一种语音信号的处理方法的流程图, 该方法包 括以下步骤:  The above described objects, features, and advantages of the present invention will become more apparent from the aspects of the invention. Embodiment 1 Referring to FIG. 1 , FIG. 1 is a flowchart of a method for processing a voice signal according to the present invention, where the method includes the following steps:
步骤 101:根据为参加会议的会场所设置的方位信息,在最大 N方会场中, 确定混音信号中每个时刻每个频段上能量最大的会场的方位信息;  Step 101: Determine, according to the orientation information set for the conference venue participating in the conference, the location information of the site with the largest energy in each frequency band at each moment in the mixed signal in the maximum N-party conference site;
上述步骤中, 需要先将最大 N方会场的语音信号进行时频变换, 将时域 下的语音信号转换为频域下的语音信号,然后计算每个时刻每个频段上的能量 值, 获得每个时刻每个频段上能量最大的会场, 最后根据为参加会议的会场所 设置的方位信息, 确定每个频段上能量最大的会场的方位信息。 其中, 可以通 过两种方法确定每个时刻每个频段上, 最大 N方会场中能量最大的会场的方 位信息。  In the above steps, the voice signal of the largest N-party site is first time-frequency transformed, the voice signal in the time domain is converted into a voice signal in the frequency domain, and then the energy value in each frequency band at each time is calculated, and each time is obtained. At the moment, the site with the highest energy in each frequency band is finally determined according to the orientation information set for the meeting place participating in the meeting, and the orientation information of the site with the largest energy in each frequency band is determined. The location information of the site with the largest energy in the largest N-party site in each frequency band at each time can be determined by two methods.
其中, 一种确定方法为: 根据参加会议的各个会场的入会顺序, 预先为入 会的会场设置方位。 当通过比较最大 N方会场的语音信号在每个频段的能量 值而获得每个时刻每个频段上最大 N方会场中能量最大的会场后, 判断所述 能量最大的会场是否在多画面中,如果是,将所述能量最大的会场的方位信息 设置为多画面方位信息, 如果否, 将所述能量最大的会场的方位信息设置为所 述预设方位信息。 例如, 在一个视讯系统中, 参加会议的会场有十个, 第一个 入会的会场编号为 1 , 第二入会的会场的编号为 2, 以此类推, 第十个入会的 会场编号为 10。 根据十个会场的入会顺序, 将会场 1-3的方位设置为左上方, 将会场 4-6的方位设置为右上方,将会场 7-8的方位设置为左下方,将会场 9-10 的方位设置为右下方, 请参阅图 2-a, 图 2-a为 10个入会会场的方位示意图。 其中, 会场 1-4为最大 4方会场, 并且, 在某一时刻的某一个频段下, 会场 1 为最大 4方会场中能量最大的会场, 则判断会场 1是否在多画面中, 当会场 1 在多画面中时,将会场 1在多画面中的方位信息设置为会场 1的方位信息, 例 如, 会场 1在多画面中的右下方, 请参阅图 2-b, 图 2-b为多画面中 4个会场 的方位示意图, 则会场 1的方位信息为右下方; 当会场 1不在多画面中时, 根 据对入会会场方位的设定可以获得, 会场 1的方位信息为左上方。 A method for determining the method is as follows: According to the order of joining the conference sites participating in the conference, the orientation of the conference site is set in advance. When comparing the energy of each band in the speech signal by comparing the largest N-party venue After obtaining the site with the largest energy in the largest N-party site in each frequency band at each time, it is determined whether the site with the largest energy is in the multi-screen, and if so, the orientation information of the site with the largest energy is set to be more The screen orientation information, if not, sets the orientation information of the site with the largest energy to the preset orientation information. For example, in a videoconferencing system, there are ten venues participating in the conference. The first conference site number is 1, the second conference site number is 2, and so on. The tenth conference site number is 10. According to the order of joining the ten venues, the orientation of field 1-3 is set to the upper left, the orientation of field 4-6 is set to the upper right, and the orientation of field 7-8 is set to the lower left, which will be 9-10. The orientation is set to the lower right, please refer to Figure 2-a. Figure 2-a shows the orientation of the 10 joining venues. The site 1-4 is the largest 4-party site, and in a certain frequency band at a certain time, the site 1 is the site with the largest energy in the largest 4-party site, and it is determined whether the site 1 is in the multi-screen, when the site 1 In multi-screen, set the orientation information of field 1 in multi-screen to the orientation information of site 1. For example, site 1 is at the bottom right of the multi-screen, see Figure 2-b, Figure 2-b is multi-screen. The orientation information of the four sites is the right lower part of the site. When the site 1 is not in the multi-screen, the orientation information of the site is obtained. The orientation information of the site 1 is the upper left.
另一种确定方法为: 在确定最大 N方会场后, 根据最大 N方会场的入会 顺序, 预先为最大 N方会场设置方位, 得到最大 N方会场的方位信息。 当通 过比较最大 N方会场的语音信号在每个时刻每个频段的能量值而获得每个时 刻每个频段上最大 N方会场中能量最大的会场后, 判断所述能量最大的会场 是否在多画面中, 当能量最大的会场在多画面中时,将所述能量最大的会场的 方位信息设置为所述能量最大的会场在多画面中的方位信息,如果否,将所述 能量最大的会场的方位信息设置为所述预设的最大 N方会场的方位信息。 以 上述十个会场之间进行视讯通信为例, 其中, 会场 1-4为最大 4方会场, 根据 会场 1-4的入会顺序, 将会场 1的方位设置为左上方, 将会场 2的方位设置为 右上方, 将会场 3的方位设置为左下方, 将会场 4的方位设置为右下方, 请参 阅图 3-a, 图 3-a为 4个最大 4方会场的方位示意图。 当通过比较能量值得知, 在某一个时刻某一个频段下,会场 1为最大 4方会场中能量最大的会场时, 则 判断会场 1是否在多画面中, 当会场 1在多画面中时,将会场 1在多画面中的 方位信息设置为会场 1的方位信息, 例如, 会场 1在多画面中的右下方, 请参 阅图 3-b, 图 3-b为多画面中 4个会场的方位示意图, 则会场 1的方位信息为 右下方; 当会场 1不在多画面中时,根据对最大 4方会场预先设置的方位可以 获得, 会场 1的方位信息为左上方。 The other method is as follows: After determining the maximum N-party site, the orientation of the largest N-party site is set in advance according to the order of joining the largest N-party site, and the orientation information of the largest N-party site is obtained. When comparing the energy value of each frequency band of the voice signal of the largest N-party site at each time to obtain the site with the largest energy in the largest N-party site in each frequency band at each time, it is determined whether the site with the largest energy is at most In the picture, when the site with the largest energy is in the multi-screen, the orientation information of the site with the largest energy is set as the orientation information of the site with the largest energy in the multi-screen, and if not, the site with the largest energy. The orientation information is set to the orientation information of the preset maximum N-party venue. Take the video communication between the above ten sites as an example. The venues 1-4 are the maximum 4-party venues. According to the order of joining the venues 1-4, the orientation of the field 1 is set to the upper left, and the orientation of the field 2 is set. For the upper right, set the orientation of field 3 to the lower left, and set the orientation of field 4 to the lower right. See Figure 3-a. Figure 3-a shows the orientation of the four largest 4-party venues. When comparing the energy value, it is known that when the site 1 is the site with the largest energy in the largest 4-party site in a certain frequency band at a certain time, it is determined whether the site 1 is in the multi-screen, and when the site 1 is in the multi-screen, The orientation information of the site 1 in the multi-screen is set to the orientation information of the site 1. For example, the site 1 is at the lower right of the multi-screen. Please refer to Figure 3-b. Figure 3-b shows the orientation of the four sites in the multi-screen. , the location information of field 1 is When the site 1 is not in the multi-screen, it can be obtained according to the preset position of the maximum 4-party site. The orientation information of the site 1 is the upper left.
上述两种确定方法中, 当能量最大的会场在多画面中的方位发生变化时, 能量最大的会场的方位信息也会随着方位的变化而相应地发生变化。  In the above two determination methods, when the orientation of the site with the largest energy changes in the multi-picture, the orientation information of the site with the largest energy also changes correspondingly with the change of the orientation.
仍然以上述 10个会场之间进行视讯通信为例, 其中, 会场 1-4为最大 4 方会场, 根据会场 1-4的入会顺序, 将会场 1的方位设置为左上方, 将会场 2 的方位设置为右上方,将会场 3的方位设置为左下方,将会场 4的方位设置为 右下方。 并且, 在某个时刻的某一个频段下, 会场 1即是最大 4方会场中能量 最大的会场,也在多画面中, 则会场 1的方位信息为会场 1在多画面中的方位 信息,假设会场 1在多画面中的方位为左上方,则会场 1的方位信息为左上方, 当会场 1在多画面中的方位被切换为右上方时, 且此时会场 1仍然为最大 4 方会场中能量最大的会场时, 则会场 1的方位信息也就相应地变化为右上方。  For example, the video communication between the above 10 sites is used as an example. The conference site 1-4 is the maximum 4-party venue. According to the order of joining the conference sites 1-4, the orientation of the field 1 is set to the upper left, and the orientation of the field 2 is Set to the upper right, set the orientation of field 3 to the lower left, and set the orientation of field 4 to the lower right. Moreover, in a certain frequency band at a certain time, the site 1 is the site with the largest energy in the largest 4-party site, and in the multi-screen, the orientation information of the field 1 is the orientation information of the site 1 in the multi-screen, assuming The orientation of the site 1 in the multi-screen is the upper left, and the orientation information of the field 1 is the upper left. When the orientation of the site 1 in the multi-screen is switched to the upper right, the site 1 is still in the maximum 4-party venue. When the site with the highest energy is used, the orientation information of field 1 changes accordingly to the upper right.
需要说明的是, 本实施例并不限定对最大 N方会场中能量最大的会场的 方位信息的设置方法, 方位信息也不限定为左上方、 右上方、 左下方和右下方 四种方位。  It should be noted that, in this embodiment, the method for setting the orientation information of the site with the largest energy in the largest N-party site is not limited, and the orientation information is not limited to the four directions of the upper left, the upper right, the lower left, and the lower right.
当多画面的画面个数大于方位个数时,多画面中的会场不能够完全对应任 何一个方位。 例如, 多画面的个数为 16, 方位的个数为 4, 多画面中的会场不 能够完全对应任何一个方位, 则为多画面中的会场取一个最近似的方位,请参 阅图 4, 图 4为多画面个数为 16, 方位个数为 4时方位的设置方法, 根据近似 原则, 将图中会场 7的方位设置为右上方。  When the number of pictures of a multi-picture is larger than the number of azimuth, the site in the multi-picture cannot completely correspond to any one of the positions. For example, if the number of multi-pictures is 16, and the number of azimuths is 4, the site in the multi-picture cannot correspond to any one of the positions, and the most similar position is taken for the site in the multi-picture. Please refer to Figure 4, 4 is a setting method in which the number of multi-pictures is 16, and the number of orientations is 4, and the orientation of the venue 7 in the figure is set to the upper right according to the approximation principle.
步骤 102:将最大 N方会场的混音信号和所述每个时刻每个频段上能量最 大的会场的方位信息进行发送。  Step 102: Send the mixed signal of the largest N-party venue and the orientation information of the site with the highest energy in each frequency band at each time.
上述步骤 102中, 可以先将最大 N方会场的混音信号和每个时刻每个频 段上能量最大的会场的方位信息进行编码, 分别得到混音码流和方位信息码 流, 然后将混音码流和方位信息码流发送给参加会议的会场终端; 或者, 也可 以只将最大 N方会场的混音信号进行编码, 得到混音码流, 然后将混音码流 和每个时刻每个频段上能量最大的会场的方位信息发送给参加会议的会场终 端。 例如, 目的会场属于最大 N方会场, 则发送给该会场的混音信号是除了 该会场以外的最大 N-1会场的混音信号。 其中,在将所述能量最大的会场的方位信息进行编码时, 以不同于混音信 号的编码方式进行编码。 例如, 当混音信号按照传统的编码协议 G.722进行编 码时, 则所述最大 N方会场中能量最大的会场的方位信息可以采用 Huffman 的方式接进行编码。 请参阅图 5, 图 5为本发明中语音信号的处理示意图。 In the foregoing step 102, the sound mixing signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time are first encoded, respectively, and the mixed code stream and the position information stream are respectively obtained, and then the sound is mixed. The code stream and the azimuth information code stream are sent to the site terminal participating in the conference; or, only the mixed signal of the largest N-party site may be encoded to obtain a mixed code stream, and then the mixed code stream and each time of each time The location information of the site with the largest energy in the frequency band is sent to the site terminal participating in the conference. For example, if the destination site belongs to the largest N-party site, the mix signal sent to the site is the mix signal of the largest N-1 site other than the site. Wherein, when the orientation information of the site having the largest energy is encoded, the encoding is performed in an encoding manner different from the mixing signal. For example, when the mixing signal is encoded according to the conventional encoding protocol G.722, the orientation information of the site with the largest energy in the largest N-party site can be encoded by Huffman. Please refer to FIG. 5. FIG. 5 is a schematic diagram of processing of a voice signal according to the present invention.
当将每个时刻每个频段上能量最大的会场的方位信息进行发送时,另一种 实现方式是:  When the orientation information of the site with the largest energy in each frequency band is transmitted at each time, another implementation manner is:
将最大 N方会场中能量最大的会场的会场编号和最大 N方会场的方位信 息一并进行发送, 从而由接收端根据会场编号从最大 N方会场的方位信息中 确定最大 N方会场中能量最大的会场的方位信息。 实施例二 与上述一种语音信号的处理方法相对应,本发明实施例还提供了一种语音 信号的处理装置。 请参阅图 6, 图 6为本发明一种语音信号的处理装置的结构 图, 该装置包括方位确定单元 601和发送单元 602。 下面结合该装置的工作原 理进一步介绍其内部结构以及连接关系。  The site number of the site with the largest energy in the largest N-party site is transmitted together with the location information of the largest N-party site. The receiver determines the maximum energy of the largest N-party site from the location information of the largest N-party site based on the site number. Location information of the venue. Embodiment 2 Corresponding to the processing method of a voice signal, the embodiment of the present invention further provides a processing device for a voice signal. Referring to FIG. 6, FIG. 6 is a structural diagram of a processing apparatus for a voice signal according to the present invention. The apparatus includes an orientation determining unit 601 and a transmitting unit 602. The internal structure and connection relationship will be further described below in conjunction with the working principle of the device.
方位确定单元 601 , 用于根据为参加会议的会场所设置的方位信息, 在最 大 N方会场中, 确定每个时刻每个频段上能量最大的会场的方位信息;  The position determining unit 601 is configured to determine, according to the orientation information set for the meeting place of the meeting, the orientation information of the site with the largest energy in each frequency band at each time in the largest N-party meeting place;
发送单元 602,用于将最大 N方会场的混音信号和所述每个时刻每个频段 上能量最大的会场的方位信息进行发送。  The sending unit 602 is configured to send the sound mixing signal of the largest N-party venue and the orientation information of the site with the largest energy in each frequency band at each time.
其中, 所述方位确定单元 601可以包括: 第一方位预设单元 603 , 用于根 据入会顺序, 依次为所述参加会议的会场预先设置方位, 得到预设方位信息; 比较单元 604, 用于比较最大 N方会场的语音信号在每个频段的能量值, 获得 每个时刻每个频段上能量最大的会场; 第一设置单元 605, 用于当所述能量最 大的会场不在多画面中时, 根据预设方位信息设置能量最大的会场的方位信 息; 第二设置单元 606, 用于当所述能量最大的会场在多画面中时, 根据多画 面方位信息设置能量最大的会场的方位信息。  The orientation determining unit 601 may include: a first orientation determining unit 603, configured to pre-set an orientation for the conference site participating in the conference according to the order of joining, to obtain preset orientation information; and comparing unit 604, for comparing The maximum value of the energy value of the voice signal of each of the N-party sites in each frequency band is obtained, and the first setting unit 605 is configured to: when the site with the largest energy is not in the multi-picture, according to The preset orientation information sets the orientation information of the site with the largest energy; the second setting unit 606 is configured to set the orientation information of the site with the largest energy according to the multi-screen orientation information when the site with the largest energy is in the multi-screen.
所述方位确定单元 601还可以包括: 第二方位预设单元, 用于根据入会顺 序,依次为最大 N方会场预先设置方位,得到最大 N方会场的预设方位信息; 比较单元, 用于比较最大 N方会场的语音信号在每个时刻每个频段的能量值, 获得每个时刻每个频段上能量最大的会场; 第三设置单元, 用于当所述能量最 大的会场不在多画面中时, 根据预设方位信息设置能量最大的会场的方位信 息; 第四设置单元, 用于当所述能量最大的会场在多画面中时, 根据多画面方 位信息设置能量最大的会场的方位信息。 The orientation determining unit 601 may further include: a second orientation presetting unit, configured to pre-set the orientation for the largest N-party venue according to the order of joining, and obtain preset orientation information of the largest N-party venue; a comparison unit, configured to compare energy values of each frequency band of the voice signal of the largest N-party site at each time, to obtain a site with the largest energy in each frequency band at each time; and a third setting unit, configured to use the maximum energy When the site is not in the multi-screen, the orientation information of the site with the largest energy is set according to the preset orientation information. The fourth setting unit is configured to set the maximum energy according to the multi-screen orientation information when the site with the largest energy is in the multi-screen. Location information of the venue.
所述发送单元 602可以包括: 第一发送单元 607和 /或第二发送单元 608, 第一发送单元 607, 用于将所述混音信号和所述每个时刻每个频段上能量 最大的会场的方位信息进行编码, 分别得到混音码流和方位信息码流, 将所述 混音码流和方位信息码流发送给参加会议的会场终端;  The sending unit 602 may include: a first sending unit 607 and/or a second sending unit 608, where the first sending unit 607 is configured to use the mixed signal and the maximum energy in each frequency band at each time. The orientation information is encoded, and the mixed code stream and the position information stream are respectively obtained, and the mixed code stream and the position information code stream are sent to the conference terminal participating in the conference;
第二发送单元 608, 用于将所述混音信号进行编码, 得到混音码流, 将所 述混音码流和所述每个时刻每个频段上能量最大的会场的方位信息发送给参 加会议的会场终端。 实施例三 请参阅图 7, 图 7为本发明一种语音信号的播放方法的流程图, 该方法包 括以下步骤:  The second sending unit 608 is configured to encode the mixed signal to obtain a mixed code stream, and send the mixed code stream and the position information of the site with the largest energy in each frequency band at each time to participate. The venue terminal of the conference. Embodiment 3 Referring to FIG. 7, FIG. 7 is a flowchart of a method for playing a voice signal according to the present invention, and the method includes the following steps:
步骤 701:获取最大 N方会场的混音信号和每个时刻每个频段上能量最大 的会场的方位信息;  Step 701: Acquire a mixing signal of a maximum N-party site and a position information of a site with the largest energy in each frequency band at each time;
上述步骤中若接收到最大 N方会场中能量最大的会场的会场编号和最大 If the site number and maximum of the site with the largest energy in the largest N-party site are received in the above steps,
N方会场的方位信息, 则首先根据会场编号从最大 N方会场的方位信息中确 定最大 N方会场中能量最大的会场的方位信息。 For the location information of the N-party site, the location information of the site with the largest energy in the largest N-party site is determined from the location information of the largest N-party site based on the site number.
上述步骤中, 当接收到的数据为混音码流和方位信息码流时,通过对所述 混音码流和方位信息码流进行解码,得到所述混音信号和每个时刻每个频段上 能量最大的会场的方位信息;当接收到的数据为混音信号码流和每个时刻每个 频段上能量最大的会场的方位信息时, 通过对所述混音码流进行解码,得到所 述混音信号, 最终获得混音信号和每个频段上能量最大的会场的方位信息。 步骤 702: 根据播放设备的听觉空间参数与方位信息之间的对应关系, 获 得与每个时刻每个频段上所述能量最大的会场的方位信息相对应的播放设备 的听觉空间参数; In the above steps, when the received data is a mixed code stream and a position information stream, the mixed signal and the position information stream are decoded to obtain the mixed signal and each frequency band at each time. The orientation information of the site with the largest energy; when the received data is the mixed signal stream and the orientation information of the site with the largest energy in each frequency band at each time, by decoding the mixed code stream, the solution is obtained. The mixed signal is obtained, and finally the mixed signal and the orientation information of the site with the largest energy in each frequency band are obtained. Step 702: According to the correspondence between the auditory space parameter and the orientation information of the playback device, Obtaining an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the highest energy on each frequency band at each time;
上述步骤中, 播放设备的听觉空间参数包括电平参数和延时参数。 步骤 902的具体实现过程可以为: 首先为播放设备预先设置与方位信息相对应的电 平参数和延时参数,当在步骤 701中获取到每个时刻每个频段上能量最大的会 场的方位信息后,查询预先为播放设备设置的方位信息与电平参数和延时参数 之间的对应关系,得到与每个时刻每个频段上能量最大的会场的方位信息相对 应的播放设备的电平参数和延时参数。  In the above steps, the auditory spatial parameters of the playback device include a level parameter and a delay parameter. The specific implementation process of step 902 may be: firstly setting a level parameter and a delay parameter corresponding to the azimuth information for the playback device, and acquiring, in step 701, the orientation information of the site with the highest energy in each frequency band at each time. After that, the corresponding relationship between the orientation information set by the playback device and the level parameter and the delay parameter is queried, and the level parameter of the playback device corresponding to the orientation information of the site with the largest energy at each time band is obtained. And delay parameters.
例如,在会场由两个扬声器作为播放设备, 所获取的某个频段上能量最大 的会场的方位信息为左上方, 则可以得到两个扬声器的电平参数和延时参数 为: 1 )扬声器 1左上方的电平参数; 2 )扬声器 2左上方的电平参数; 3 )扬 声器 1左上方的延时参数; 4 )扬声器 2左上方的延时参数。  For example, in the conference site, two speakers are used as the playback device, and the position information of the site with the largest energy in a certain frequency band acquired is the upper left, and the level parameters and delay parameters of the two speakers can be obtained as follows: 1) Speaker 1 Level parameter in the upper left; 2) Level parameter in the upper left of speaker 2; 3) Delay parameter in the upper left of speaker 1; 4) Delay parameter in the upper left of speaker 2.
步骤 703: 利用所述播放设备的听觉空间参数调整所述混音信号, 以便对 调整后的混音信号进行播放。  Step 703: Adjust the mixing signal by using the auditory space parameter of the playing device to play the adjusted mixing signal.
其中, 需要先将混音信号进行时频变换,将时域下的混音信号转换为频域 下的混音信号,当获得与每个频段上所述能量最大的会场的方位信息相对应的 播放设备的听觉空间参数后, 分别在每个频段上, 利用播放设备的听觉空间参 数对频域下的混音信号的电平和延时进行调整。 请参阅图 8 , 图 8为各频段下 播放设备的听觉空间参数调整示意图。 当对每个频段上的混音信号进行调整 后,将调整后的混音信号进行时频反变换,把频域下的混音信号转换为时域下 的混音信号, 最后将时域下的混音信号通过播放设备进行播放。 实施例四 与上述一种语音信号的播放方法相对应,本发明实施例还提供了一种语音 信号的播放装置。 请参阅图 9, 图 9为本发明一种语音信号的播放装置的结构 图, 该装置包括获取单元 901、 空间参数获得单元 902和调整单元 903。 下面 结合该装置的工作原理进一步介绍其内部结构以及连接关系。  Wherein, the time-frequency conversion of the mixed signal is first performed, and the mixed signal in the time domain is converted into the mixed signal in the frequency domain, and the orientation information corresponding to the site with the largest energy in each frequency band is obtained. After playing the auditory spatial parameters of the device, the level and delay of the mixing signal in the frequency domain are adjusted by using the auditory spatial parameters of the playing device in each frequency band. Please refer to Figure 8 and Figure 8 for the adjustment of the auditory space parameters of the playback device in each frequency band. After adjusting the mixing signal on each frequency band, the adjusted mixing signal is inversely transformed in time and frequency, and the mixing signal in the frequency domain is converted into a mixing signal in the time domain, and finally in the time domain. The mix signal is played through the playback device. Embodiment 4 Corresponding to the above-mentioned method for playing a voice signal, the embodiment of the present invention further provides a device for playing a voice signal. Referring to FIG. 9, FIG. 9 is a structural diagram of a playback apparatus for a voice signal according to the present invention. The apparatus includes an acquisition unit 901, a spatial parameter obtaining unit 902, and an adjustment unit 903. The internal structure and connection relationship will be further described below in conjunction with the working principle of the device.
获取单元 901 ,用于获取最大 N方会场的混音信号和每个时刻每个频段上 能量最大的会场的方位信息; The obtaining unit 901 is configured to acquire a mixing signal of a maximum N-party venue and each frequency band at each moment The orientation information of the site with the largest energy;
空间参数获得单元 902 , 用于根据播放设备的听觉空间参数与方位信息之 间的对应关系,获得与每个时刻每个频段上所述能量最大的会场的方位信息相 对应的播放设备的听觉空间参数;  The spatial parameter obtaining unit 902 is configured to obtain, according to the correspondence between the auditory spatial parameter and the orientation information of the playback device, the auditory space of the playback device corresponding to the orientation information of the site with the largest energy in each frequency band at each time. Parameter
调整单元 903 , 用于利用所述播放设备的听觉空间参数调整所述混音信 号, 以便对调整后的混音信号进行播放。  The adjusting unit 903 is configured to adjust the mixed signal by using an auditory space parameter of the playing device, so as to play the adjusted mixed signal.
其中, 获取单元 901可以包括:  The obtaining unit 901 may include:
第一接收单元 904, 用于接收混音码流和方位信息码流; 和  a first receiving unit 904, configured to receive a mixed code stream and a position information stream; and
第一解码单元 905 , 用于对所述混音码流和方位信息码流进行解码, 得到 所述混音信号和每个时刻每个频段上能量最大的会场的方位信息。  The first decoding unit 905 is configured to decode the mixed code stream and the azimuth information code stream to obtain the mixed signal and the orientation information of the site with the largest energy in each frequency band at each time.
上述第一接收单元 904可以替换为第二接收单元,用于接收混音码流和所 述每个时刻每个频段上能量最大的会场的方位信息;第一解码单元 905可以替 换为第二解码单元, 用于对所述混音码流进行解码, 得到所述混音信号。  The first receiving unit 904 may be replaced by a second receiving unit, configured to receive the mixed code stream and the orientation information of the site with the largest energy in each frequency band at each time; the first decoding unit 905 may be replaced with the second decoding. And a unit, configured to decode the mixed code stream to obtain the mixed signal.
在本实施例中, 获取单元 901还可以同时包括第一接收单元、第一解码单 元和第二接收单元、 第二解码单元。  In this embodiment, the obtaining unit 901 may further include a first receiving unit, a first decoding unit, and a second receiving unit, and a second decoding unit.
空间参数获得单元 902可以包括:  The spatial parameter obtaining unit 902 may include:
听觉空间参数预设单元 906, 用于为播放设备预先设置与方位信息相对应 的电平参数和延时参数; 和  The auditory spatial parameter preset unit 906 is configured to preset a level parameter and a delay parameter corresponding to the orientation information for the playback device; and
查询单元 907 , 用于查询所述方位信息与电平参数和延时参数之间的对应 关系,得到与所述每个时刻每个频段上能量最大的会场的方位信息相对应的电 平参数和延时参数。  The query unit 907 is configured to query a correspondence between the orientation information and the level parameter and the delay parameter, and obtain a level parameter corresponding to the orientation information of the site with the largest energy in each frequency band at each moment. Delay parameter.
需要说明的是,图 9并没有定义一个语音信号的播放装置所具有的完整的 结构图,而只是突出本发明的发明点所涉及的部分。对于本领域技术人员来说, 可以清楚地得知, 该语音信号的播放装置还应该包括有播放器, 并且, 上述调 整单元 903输出的调整后的混音信号作为播放器的输入信号,由播放器对调整 后的混音信号进行播放。  It should be noted that Fig. 9 does not define a complete structural diagram of a playback apparatus for a voice signal, but merely highlights the part involved in the inventive aspect of the present invention. It will be clear to those skilled in the art that the playback device of the voice signal should also include a player, and the adjusted mix signal output by the adjustment unit 903 is used as an input signal of the player. The player plays the adjusted mix signal.
由上述本发明的实施例可以看出,在对语音信号进行处理时,预先为参加 会议的所有会场设置方位信息, 并在最大 N方会场中, 确定在每个频段上能 量最大的会场的方位信息,将所述方位信息与混音信号一起发送。在对语音信 号进行播放时,根据接收的方位信息以及方位信息的播放设备空间参数之间的 对应关系,得到播放端每个播放设备的空间参数, 利用播放设备的空间参数来 调整混音信号,在将调整后的混音信号播放时, 可以在会场重构音源的听觉空 间, 使最大 N方会场的声音在播放时具有空间的立体感觉, 用户能够听清楚 每个最大 N方会场的声音, 更增加了用户的临场体验感觉。 It can be seen from the foregoing embodiment of the present invention that when the voice signal is processed, the orientation information is set in advance for all the sites participating in the conference, and in the largest N-party conference site, the orientation of the site with the largest energy in each frequency band is determined. Information that transmits the orientation information along with the mixing signal. In the voice letter When playing the number, according to the corresponding relationship between the received orientation information and the spatial parameters of the playback device, the spatial parameters of each playback device at the playback end are obtained, and the spatial parameters of the playback device are used to adjust the mixing signal, which will be adjusted. When the rear mixing signal is played, the auditory space of the sound source can be reconstructed in the venue, so that the sound of the largest N-party venue has a spatial stereoscopic feeling during playback, and the user can clearly understand the sound of each of the largest N-party venues, and further increases the sound. The user's experience of the spot experience.
此外, 当能量最大的会场在多画面中时, 能量最大的会场的方位信息会随 着它在多画面中方位的变化而相应地发生变化,从而在对语音信号播放时,使 音源的方位与图像的方位相一致, 进一步增加了用户的临场体验感觉。  In addition, when the site with the largest energy is in multiple pictures, the orientation information of the site with the largest energy will change accordingly with the change of the orientation of the site in the multi-picture, so that the orientation of the source is made when the voice signal is played. The orientation of the images is consistent, further increasing the user's experience of the presence experience.
以上对本发明所提供的一种语音信号的处理、播放方法和装置进行了详细 实施例的说明只是用于帮助理解本发明的方法及其核心思想; 同时,对于本领 域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有 改变之处, 综上所述, 本说明书内容不应理解为对本发明的限制。  The foregoing detailed description of the processing, playback method, and apparatus for a voice signal provided by the present invention is only for facilitating understanding of the method and core idea of the present invention. Meanwhile, for those of ordinary skill in the art, The present invention is not limited by the scope of the present invention.

Claims

权 利 要 求 Rights request
1、 一种语音信号的处理方法, 其特征在于, 所述方法包括:  A method for processing a voice signal, the method comprising:
根据为参加会议的会场所设置的方位信息, 在最大 N方会场中, 确定混 音信号中每个时刻每个频段上能量最大的会场的方位信息;  According to the orientation information set for the meeting place of the conference, in the largest N-party venue, the orientation information of the site with the largest energy in each frequency band at each moment in the mixed signal is determined;
将最大 N方会场的混音信号和所述每个时刻每个频段上能量最大的会场 的方位信息发送给参加会议的会场终端。  The mixing signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time are sent to the conference terminal participating in the conference.
2、 根据权利要求 1所述的方法, 其特征在于, 所述根据为参加会议的会 场所设置的方位信息, 在最大 N方会场中, 确定每个时刻每个频段上能量最 大的会场的方位信息包括:  The method according to claim 1, wherein the orientation information set for the meeting place participating in the meeting determines the orientation of the site with the largest energy in each frequency band at each time in the largest N-party meeting site. Information includes:
根据入会顺序,依次为所述参加会议的会场预先设置方位,得到预设方位 信息;  According to the order of joining, the orientation is preset for the venue where the conference is attended, and the preset orientation information is obtained;
比较最大 N方会场的语音信号在每个时刻每个频段的能量值, 获得每个 时刻每个频段上能量最大的会场;  Comparing the energy values of the speech signals of the largest N-party site at each time in each frequency band, and obtaining the site with the largest energy in each frequency band at each time;
当所述能量最大的会场不在多画面中时,根据预设方位信息设置能量最大 的会场的方位信息, 当所述能量最大的会场在多画面中时,根据多画面方位信 息设置能量最大的会场的方位信息。  When the site with the largest energy is not in the multi-screen, the orientation information of the site with the largest energy is set according to the preset orientation information. When the site with the largest energy is in the multi-screen, the site with the largest energy is set according to the multi-screen orientation information. Bearing information.
3、 根据权利要求 1所述的方法, 其特征在于, 所述根据为参加会议的会 场所设置的方位信息, 在最大 N方会场中, 确定每个时刻每个频段上能量最 大的会场的方位信息包括:  The method according to claim 1, wherein the orientation information set for the meeting place participating in the meeting determines the orientation of the site with the largest energy in each frequency band at each time in the largest N-party meeting site. Information includes:
根据入会顺序, 依次为最大 N方会场预先设置方位, 得到最大 N方会场 的预设方位信息;  According to the order of joining, the orientation of the largest N-party venue is preset in order, and the preset orientation information of the largest N-party venue is obtained;
比较最大 N方会场的语音信号在每个时刻每个频段的能量值, 获得每个 时刻每个频段上能量最大的会场;  Comparing the energy values of the speech signals of the largest N-party site at each time in each frequency band, and obtaining the site with the largest energy in each frequency band at each time;
当所述能量最大的会场不在多画面中时,根据预设方位信息设置能量最大 的会场的方位信息, 当所述能量最大的会场在多画面中时,根据多画面方位信 息设置能量最大的会场的方位信息。  When the site with the largest energy is not in the multi-screen, the orientation information of the site with the largest energy is set according to the preset orientation information. When the site with the largest energy is in the multi-screen, the site with the largest energy is set according to the multi-screen orientation information. Bearing information.
4、 根据权利要求 1所述的方法, 其特征在于, 所述将最大 N方会场的混 音信号和每个时刻每个频段上能量最大的会场的方位信息发送给参加会议的 会场终端包括: 将所述混音信号和所述每个时刻每个频段上能量最大的会场的方位信息 进行编码, 分别得到混音码流和方位信息码流,将所述混音码流和方位信息码 流发送给参加会议的会场终端。 The method according to claim 1, wherein the transmitting the location information of the maximum N-party site and the location information of the site with the largest energy in each frequency band at each time to the conference site includes: And encoding the sound mixing signal and the orientation information of the site with the largest energy in each frequency band at each time to obtain a mixed code stream and a position information code stream respectively, and the mixed code stream and the position information stream Send to the venue terminal that attended the conference.
5、 根据权利要求 4所述的方法, 其特征在于, 所述将最大 N方会场的混 音信号和所述每个时刻每个频段上能量最大的会场的方位信息进行编码的方 式不同于混音信号的编码方式。  The method according to claim 4, wherein the manner of encoding the mixed signal of the largest N-party venue and the orientation information of the site with the largest energy in each frequency band at each time is different from the hybrid The way the signal is encoded.
6、 根据权利要求 1所述的方法, 其特征在于, 所述将最大 N方会场的混 音信号和每个时刻每个频段上能量最大的会场的方位信息发送给参加会议的 会场终端包括:  The method according to claim 1, wherein the transmitting the location information of the maximum N-party venue and the location information of the site with the largest energy in each frequency band at each time to the conference site includes:
将所述混音信号进行编码,得到混音码流,将所述混音码流和所述每个时 刻每个频段上能量最大的会场的方位信息发送给参加会议的会场终端。  The mixed signal is encoded to obtain a mixed code stream, and the azimuth information of the mixed code stream and the site with the highest energy in each frequency band is transmitted to the conference terminal participating in the conference.
7、 根据权利要求 1所述的方法, 其特征在于, 所述将最大 N方会场的混 音信号和每个时刻每个频段上能量最大的会场的方位信息发送给参加会议的 会场终端包括:  The method according to claim 1, wherein the transmitting the location information of the maximum N-party site and the location information of the site with the largest energy in each frequency band at each time to the conference site includes:
将所述混音信号进行编码,得到混音码流,将所述混音码流和所述每个时 刻每个频段上能量最大的会场的编号以及最大 N方会场的方位信息发送给参 加会议的会场终端。  And encoding the mixed signal to obtain a mixed code stream, and sending the mixed code stream and the number of the site with the largest energy in each frequency band at each time and the position information of the largest N-party site to the conference. The venue terminal.
8、 一种语音信号的播放方法, 其特征在于, 所述方法包括:  8. A method for playing a voice signal, the method comprising:
获取最大 N方会场的混音信号和每个时刻每个频段上能量最大的会场的 方位信息;  Obtaining the mixing signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time;
根据播放设备的听觉空间参数与方位信息之间的对应关系,获得与每个时 刻每个频段上所述能量最大的会场的方位信息相对应的播放设备的听觉空间 参数;  Obtaining an auditory spatial parameter of the playback device corresponding to the orientation information of the site with the highest energy in each frequency band at each moment according to the correspondence between the auditory spatial parameter and the orientation information of the playback device;
利用所述播放设备的听觉空间参数调整所述混音信号,并对调整后的混音 信号进行播放。  The mix signal is adjusted using the auditory spatial parameters of the playback device, and the adjusted mix signal is played.
9、 根据权利要求 8所述的方法, 其特征在于, 所述获取最大 N方会场的 混音信号和每个时刻每个频段上能量最大的会场的方位信息包括:  The method according to claim 8, wherein the acquiring the mixing signal of the largest N-party site and the location information of the site having the largest energy in each frequency band at each time include:
接收混音码流和方位信息码流;  Receiving a mixed code stream and a position information stream;
对所述混音码流和方位信息码流进行解码,得到所述混音信号和每个时刻 每个频段上能量最大的会场的方位信息。 Decoding the mixed code stream and the position information code stream to obtain the mixed signal and each moment The orientation information of the site with the highest energy in each band.
10、 根据权利要求 8所述的方法, 其特征在于, 所述获取混音信号和每个 时刻每个频段上能量最大的会场的方位信息包括:  The method according to claim 8, wherein the acquiring the sounding signal and the orientation information of the site having the largest energy in each frequency band at each time comprises:
接收混音码流和所述每个频段上能量最大的会场的方位信息;  Receiving the mixed code stream and the orientation information of the site with the largest energy in each frequency band;
对所述混音码流进行解码, 得到所述混音信号。  Decoding the mixed code stream to obtain the mixed signal.
11、 根据权利要求 8所述的方法, 其特征在于, 所述播放设备的听觉空间 参数包括: 电平参数和延时参数。  The method according to claim 8, wherein the auditory spatial parameters of the playback device comprise: a level parameter and a delay parameter.
12、 根据权利要求 11所述的方法, 其特征在于, 所述根据播放设备的听 觉空间参数与方位信息之间的对应关系,获得与每个时刻每个频段上所述能量 最大的会场的方位信息相对应的播放设备的听觉空间参数包括: 查询为播放设备预先设置的方位信息与电平参数和延时参数之间的对应 关系,得到与所述每个时刻每个频段上能量最大的会场的方位信息相对应的电 平参数和延时参数。  The method according to claim 11, wherein, according to the correspondence between the auditory spatial parameter and the orientation information of the playback device, the orientation of the site with the largest energy in each frequency band at each time is obtained. The auditory spatial parameters of the playback device corresponding to the information include: querying the correspondence between the orientation information preset by the playback device and the level parameter and the delay parameter, and obtaining the conference with the largest energy in each frequency band at each moment. The orientation information corresponds to the level parameter and the delay parameter.
13、 根据权利要求 8所述的方法, 其特征在于, 所述获取最大 N方会场 的混音信号和每个时刻每个频段上能量最大的会场的方位信息包括:  The method according to claim 8, wherein the acquiring the mixed signal of the largest N-party site and the location information of the site having the largest energy in each frequency band at each time include:
接收混音码流和每个时刻每个频段上能量最大的会场的编号以及最大 N 方会场的方位信息;  Receiving the mixed code stream and the number of the site with the largest energy in each frequency band at each time and the orientation information of the largest N-party site;
对所述混音码流和方位信息码流进行解码, 得到所述混音信号; 根据每个时刻每个频段上能量最大的会场的编号以及最大 N方会场的方 位信息, 获取每个时刻每个频段上能量最大的会场的方位信息。  Decoding the mixed code stream and the azimuth information code stream to obtain the mixed signal; obtaining each time every time according to the number of the site with the largest energy and the position information of the largest N-party site in each frequency band at each time The position information of the site with the highest energy in the frequency band.
14、 一种语音信号的处理装置, 其特征在于, 所述装置包括:  14. A device for processing a voice signal, the device comprising:
方位确定单元, 用于根据为参加会议的会场所设置的方位信息, 在最大 N 方会场中, 确定混音信号中每个时刻每个频段上能量最大的会场的方位信息; 发送单元, 用于将最大 N方会场的混音信号和所述每个时刻每个频段上 能量最大的会场的方位信息发送给参加会议的会场终端。  The position determining unit is configured to determine, according to the orientation information set for the meeting place of the meeting, the orientation information of the site with the largest energy in each frequency band at each time in the mixed signal in the maximum N-party meeting place; The mixing signal of the largest N-party site and the orientation information of the site with the largest energy in each frequency band at each time are sent to the conference terminal participating in the conference.
15、 根据权利要求 14所述的装置, 其特征在于, 所述方位确定单元包括: 第一方位预设单元, 用于根据入会顺序,依次为所述参加会议的会场预先 设置方位, 得到预设方位信息; 比较单元, 用于比较最大 N方会场的语音信号在每个时刻每个频段的能 量值, 获得每个时刻每个频段上能量最大的会场; The apparatus according to claim 14, wherein the orientation determining unit comprises: a first orientation presetting unit, configured to pre-set an orientation for the conference site participating in the conference according to an order of enrollment, and obtain a preset Bearing information a comparison unit, configured to compare energy values of each frequency band of the voice signal of the largest N-party site at each time, and obtain a site with the largest energy in each frequency band at each time;
第一设置单元, 用于当所述能量最大的会场不在多画面中时,根据预设方 位信息设置能量最大的会场的方位信息;  a first setting unit, configured to: when the site with the largest energy is not in the multi-screen, set the orientation information of the site with the largest energy according to the preset location information;
第二设置单元, 用于当所述能量最大的会场在多画面中时,根据多画面方 位信息设置能量最大的会场的方位信息。  The second setting unit is configured to set the orientation information of the site with the largest energy according to the multi-screen orientation information when the site with the largest energy is in the multi-screen.
16、 根据权利要求 14所述的装置, 其特征在于, 所述方位确定单元包括: 第二方位预设单元, 用于根据入会顺序, 依次为最大 N方会场预先设置 方位, 得到最大 N方会场的预设方位信息;  The device according to claim 14, wherein the orientation determining unit comprises: a second orientation preset unit, configured to pre-set the orientation for the largest N-party venue according to the order of joining, to obtain a maximum N-party venue. Preset orientation information;
比较单元, 用于比较最大 N方会场的语音信号在每个时刻每个频段的能 量值, 获得每个时刻每个频段上能量最大的会场;  a comparison unit, configured to compare energy values of each frequency band of the voice signal of the largest N-party site at each time, and obtain a site with the largest energy in each frequency band at each time;
第三设置单元, 用于当所述能量最大的会场不在多画面中时,根据预设方 位信息设置能量最大的会场的方位信息;  a third setting unit, configured to: when the site with the largest energy is not in the multi-screen, set the orientation information of the site with the largest energy according to the preset location information;
第四设置单元, 用于当所述能量最大的会场在多画面中时,根据多画面方 位信息设置能量最大的会场的方位信息。  The fourth setting unit is configured to set the orientation information of the site with the largest energy according to the multi-screen orientation information when the site with the largest energy is in the multi-screen.
17、 根据权利要求 14-16任意一项所述的装置, 其特征在于, 所述发送单 元包括:  The device according to any one of claims 14-16, wherein the transmitting unit comprises:
第一发送单元,用于将所述混音信号和所述每个时刻每个频段上能量最大 的会场的方位信息进行编码, 分别得到混音码流和方位信息码流,将所述混音 码流和方位信息码流发送给参加会议的会场终端;  a first sending unit, configured to encode the sound mixing signal and the orientation information of the site with the largest energy in each frequency band at each time, respectively, to obtain a mixed code stream and a position information code stream, respectively, to mix the sound The code stream and the direction information stream are sent to the conference terminal participating in the conference;
和 /或,  and / or,
第二发送单元, 用于将所述混音信号进行编码, 得到混音码流, 将所述混 音码流和所述每个时刻每个频段上能量最大的会场的方位信息发送给参加会 议的会场终端。  a second sending unit, configured to: encode the mixed signal to obtain a mixed code stream, and send the mixed code stream and the position information of the site with the largest energy in each frequency band at each time to the conference The venue terminal.
18、 一种语音信号的播放装置, 其特征在于, 所述装置包括:  18. A playback device for a voice signal, the device comprising:
获取单元, 用于获取最大 N方会场的混音信号和每个时刻每个频段上能 量最大的会场的方位信息;  An acquiring unit, configured to acquire a mixing signal of a maximum N-party site and a position information of a site with the largest energy amount in each frequency band at each time;
空间参数获得单元,用于根据播放设备的听觉空间参数与方位信息之间的 对应关系,获得与每个时刻每个频段上所述能量最大的会场的方位信息相对应 的播放设备的听觉空间参数; a spatial parameter obtaining unit, configured to obtain, according to a correspondence between the auditory spatial parameter and the azimuth information of the playback device, corresponding to the orientation information of the site with the largest energy in each frequency band at each time Auditory spatial parameters of the playback device;
调整单元, 用于利用所述播放设备的听觉空间参数调整所述混音信号, 以 便对调整后的混音信号进行播放。  And an adjusting unit, configured to adjust the mixing signal by using an auditory space parameter of the playing device, so as to play the adjusted mixing signal.
19、 根据权利要求 18所述的装置, 其特征在于, 所述获取单元包括: 第一接收单元, 用于接收混音码流和方位信息码流;  The device according to claim 18, wherein the acquiring unit comprises: a first receiving unit, configured to receive a mixed code stream and a position information code stream;
第一解码单元, 用于对所述混音码流和方位信息码流进行解码,得到所述 混音信号和每个时刻每个频段上能量最大的会场的方位信息。  And a first decoding unit, configured to decode the mixed code stream and the azimuth information code stream, to obtain the mixed signal and the orientation information of the site with the largest energy in each frequency band at each time.
20、 根据权利要求 18所述的装置, 其特征在于, 所述获取单元包括: 第二接收单元,用于接收混音码流和所述每个时刻每个频段上能量最大的 会场的方位信息;  The device according to claim 18, wherein the acquiring unit comprises: a second receiving unit, configured to receive the mixed code stream and the orientation information of the site with the largest energy in each frequency band at each time ;
第二解码单元, 用于对所述混音码流进行解码, 得到所述混音信号。  And a second decoding unit, configured to decode the mixed code stream to obtain the mixed signal.
PCT/CN2010/070491 2009-02-19 2010-02-03 Method and device for processing and reproducing speech signals WO2010094219A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200910005681.X 2009-02-19
CN200910005681XA CN101510988B (en) 2009-02-19 2009-02-19 Method and apparatus for processing and playing voice signal

Publications (1)

Publication Number Publication Date
WO2010094219A1 true WO2010094219A1 (en) 2010-08-26

Family

ID=41003219

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/070491 WO2010094219A1 (en) 2009-02-19 2010-02-03 Method and device for processing and reproducing speech signals

Country Status (2)

Country Link
CN (1) CN101510988B (en)
WO (1) WO2010094219A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101951492A (en) * 2010-09-15 2011-01-19 中兴通讯股份有限公司 Method and device for recording videos in video call
CN116403589A (en) * 2023-03-01 2023-07-07 天地阳光通信科技(北京)有限公司 Audio processing method, unit and system

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510988B (en) * 2009-02-19 2012-03-21 华为终端有限公司 Method and apparatus for processing and playing voice signal
CN102222503B (en) * 2010-04-14 2013-08-28 华为终端有限公司 Mixed sound processing method, device and system of audio signal
CN102270456B (en) * 2010-06-07 2012-11-21 华为终端有限公司 Method and device for audio signal mixing processing
CN101877643B (en) * 2010-06-29 2014-12-10 中兴通讯股份有限公司 Multipoint sound-mixing distant view presenting method, device and system
CN102436818A (en) * 2011-10-25 2012-05-02 浙江万朋网络技术有限公司 Routing and overdubbing method for server end based on priority of energy
CN103794216B (en) * 2014-02-12 2016-08-24 能力天空科技(北京)有限公司 A kind of sound mixing processing method and processing device
CN103870234B (en) * 2014-02-27 2017-03-15 北京六间房科技有限公司 A kind of sound mixing method and its device
CN104167210A (en) * 2014-08-21 2014-11-26 华侨大学 Lightweight class multi-side conference sound mixing method and device
CN115065571B (en) * 2022-06-14 2023-10-27 南昌职业大学 Voice equipment for big conference place

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030026441A1 (en) * 2001-05-04 2003-02-06 Christof Faller Perceptual synthesis of auditory scenes
JP2005110103A (en) * 2003-10-01 2005-04-21 Kyushu Electronics Systems Inc Voice normalizing method in video conference
US20050135280A1 (en) * 2003-12-18 2005-06-23 Lam Siu H. Distributed processing in conference call systems
CN101510988A (en) * 2009-02-19 2009-08-19 深圳华为通信技术有限公司 Method and apparatus for processing and playing voice signal

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1962547B1 (en) * 2005-11-02 2012-06-13 Yamaha Corporation Teleconference device
CN1937664B (en) * 2006-09-30 2010-11-10 华为技术有限公司 System and method for realizing multi-language conference
CN101179693B (en) * 2007-09-26 2011-02-02 深圳市迪威视讯股份有限公司 Mixed audio processing method of session television system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030026441A1 (en) * 2001-05-04 2003-02-06 Christof Faller Perceptual synthesis of auditory scenes
JP2005110103A (en) * 2003-10-01 2005-04-21 Kyushu Electronics Systems Inc Voice normalizing method in video conference
US20050135280A1 (en) * 2003-12-18 2005-06-23 Lam Siu H. Distributed processing in conference call systems
CN101510988A (en) * 2009-02-19 2009-08-19 深圳华为通信技术有限公司 Method and apparatus for processing and playing voice signal

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101951492A (en) * 2010-09-15 2011-01-19 中兴通讯股份有限公司 Method and device for recording videos in video call
CN116403589A (en) * 2023-03-01 2023-07-07 天地阳光通信科技(北京)有限公司 Audio processing method, unit and system
CN116403589B (en) * 2023-03-01 2024-06-11 天地阳光通信科技(北京)有限公司 Audio processing method, unit and system

Also Published As

Publication number Publication date
CN101510988A (en) 2009-08-19
CN101510988B (en) 2012-03-21

Similar Documents

Publication Publication Date Title
WO2010094219A1 (en) Method and device for processing and reproducing speech signals
US9843455B2 (en) Conferencing system with spatial rendering of audio data
US8477950B2 (en) Home theater component for a virtualized home theater system
US20190073993A1 (en) Artificially generated speech for a communication session
CN101132516B (en) Method, system for video communication and device used for the same
US8243120B2 (en) Method and device for realizing private session in multipoint conference
WO2011153905A1 (en) Method and device for audio signal mixing processing
US9113034B2 (en) Method and apparatus for processing audio in video communication
US9172912B2 (en) Telepresence method, terminal and system
US20110261151A1 (en) Video and audio processing method, multipoint control unit and videoconference system
US20130064387A1 (en) Audio processing method, system, and control server
US8749611B2 (en) Video conference system
US9088690B2 (en) Video conference system
WO2008141539A1 (en) A caption display method and a video communication system, apparatus
WO2013053336A1 (en) Sound mixing method, device and system
WO2011057511A1 (en) Method, apparatus and system for implementing audio mixing
WO2012142975A1 (en) Conference terminal audio signal processing method, and conference terminal and video conference system
WO2008014697A1 (en) A method and an apparatus for obtaining acoustic source location information and a multimedia communication system
CN112135285B (en) Real-time audio interaction method for multi-Bluetooth audio equipment
WO2011127816A1 (en) Mixing processing method, device and system of audio signals
WO2011015136A1 (en) Method, equipment and system for conference control
WO2014094461A1 (en) Method, device and system for processing video/audio information in video conference
WO2012055291A1 (en) Method and system for transmitting audio data
WO2014026478A1 (en) Video conference signal processing method, video conference server and video conference system
WO2011153926A1 (en) Method for broadcasting meeting place image and multipoint control unit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10743400

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10743400

Country of ref document: EP

Kind code of ref document: A1