WO2014094461A1 - Method, device and system for processing video/audio information in video conference - Google Patents

Method, device and system for processing video/audio information in video conference Download PDF

Info

Publication number
WO2014094461A1
WO2014094461A1 PCT/CN2013/083170 CN2013083170W WO2014094461A1 WO 2014094461 A1 WO2014094461 A1 WO 2014094461A1 CN 2013083170 W CN2013083170 W CN 2013083170W WO 2014094461 A1 WO2014094461 A1 WO 2014094461A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
conference
sign language
processing
video
Prior art date
Application number
PCT/CN2013/083170
Other languages
French (fr)
Chinese (zh)
Inventor
倪为
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014094461A1 publication Critical patent/WO2014094461A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a method, device, and system for processing video and audio information in a video conference.
  • deaf-mute people or deaf-mute people communicate with normal people remotely, they usually use special equipment or systems, such as using terminals that implement sign language, text, and voice conversion, or let third parties intervene in sign language/voice translation. Work to solve the problem of face-to-face communication between deaf-mute people or deaf-mute people and normal people.
  • Figure 1 is a schematic diagram of remote communication between deaf and dumb people.
  • the deaf-mute A performs sign language expression, and uses the video communication terminal A (such as video phone, video conference terminal, desktop soft terminal, etc.) to collect the sign language image, and then transmits it to the video communication terminal B via the communication network, and the deaf-mute B passes the video.
  • the presentation of the communication terminal B seeing the sign language image of the deaf-mute A, understanding the meaning of the other party's expression. The reverse is also true, and the entire communication process is completed.
  • Fig. 2 is a schematic diagram of remote communication between a deaf person and a normal person.
  • the sign language image of the deaf person is collected by the video communication terminal A, and distributed to the video communication of the normal person and the translator through the multipoint control unit.
  • the translated voice is picked up by the video communication terminal, and distributed to the video communication terminal B where the normal person is located through the multi-point control unit, and the normal person passes the voice of the translator. Understand the content expressed by deaf people.
  • the normal person's voice is picked up by the video communication terminal B and distributed to the translator through the multipoint control unit.
  • the translator translates it into sign language
  • the sign language image is collected by the video communication terminal, and distributed to the deaf and mute person through the multi-point control unit, and the deaf-mute person understands the normal person expression through the translator's sign language image.
  • Content is a convenient means for converting the normal person expression into the translator's sign language image.
  • the object of the present invention is to solve the problem that the deaf-mute person can't communicate freely and effectively when participating in multi-party remote video conference in the prior art, and provides a method, device and system for processing video and audio information in a video conference.
  • an embodiment of the present invention provides a method for processing video and audio information in a video conference, where the video conference includes at least two conference sites, and each conference site includes at least one conference terminal, and the method includes:
  • Determining when the sign language information exists in the at least two pieces of decoded information, converting the sign language information into voice information, and performing speech synthesis processing on the converted voice information to generate synthesized speech information; and generating the synthesized speech
  • the information is mixed with other decoded audio information; the audio information after the mixing is sent to the at least two sites.
  • the receiving the data code stream sent by the at least two conference terminals, and decoding the data code stream to obtain at least two pieces of decoding information further includes:
  • Performing voice synthesis processing on the converted voice information to generate synthesized voice information is specifically:
  • the performing voice synthesis on the converted voice information Processing, generating synthetic voice information is specifically:
  • the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.
  • the method before the converting the sign language information into the voice information, the method further includes: recording a first time value of each of the sign language information obtained after decoding;
  • the mixing and processing the generated synthesized speech information with other decoded audio information is:
  • the method further includes:
  • the method further includes: converting the sign language information into text information, and superimposing the text information with a current image in the conference terminal, for at least two A venue is displayed.
  • an embodiment of the present invention provides a processing device for video and audio information in a video conference, where the device includes:
  • a decoding unit configured to receive a data code stream sent by at least two conference terminals, and use the data code The stream is decoded to obtain at least two pieces of decoded information
  • a conversion synthesizing unit configured to: when the sign language information exists in at least two pieces of decoding information in the decoding unit, convert the sign language information into voice information, and perform speech synthesis processing on the converted voice information to generate a composite voice message;
  • a mixing processing unit configured to perform mixing processing on the synthesized voice information generated by the conversion combining unit and other decoded audio information
  • a sending unit configured to send the audio information after the mixing in the mixing processing unit to the at least two conference sites.
  • the decoding unit is further configured to: receive user attribute information that is input by a user or sent by the conference terminal;
  • the conversion synthesizing unit is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.
  • the device further includes:
  • a determining unit configured to determine whether the number of sites where the conference terminal that sends the sign language information is located exceeds a first threshold, and sends the determination result to the conversion synthesizing unit;
  • the conversion synthesizing unit is configured to: when receiving the determination result that the determining unit determines that the number of the meeting sites does not exceed the first threshold, performing the speech synthesis processing on the converted speech information to generate a synthesized speech Information
  • the conversion synthesizing unit is configured to: when receiving the determination result that the determining unit determines that the number of the site exceeds the first threshold, performing the converted voice information that does not exceed the first threshold Speech synthesis processing to generate synthesized speech information.
  • the conversion synthesizing unit is further configured to: record a first time value of each of the sign language information obtained after decoding;
  • the mixing processing unit is specifically configured to: sort the first time value from large to small; according to the ordering of the first time value, perform the gain according to the preset gain coefficient according to a preset gain coefficient Zoom in
  • the device further includes:
  • a sign language conversion unit configured to convert the other decoded audio information participating in the mixing process into sign language information
  • a scaling processing unit configured to perform scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to the preset ratio
  • a superimposing unit configured to superimpose the scaled processed sign language information and the current image in the conference terminal to be displayed in at least two conference sites.
  • the device further includes: a text conversion unit, configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal , used to display in at least two venues.
  • a text conversion unit configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal , used to display in at least two venues.
  • an embodiment of the present invention provides a system for processing video and audio information in a video conference, where the system includes: at least two conference sites, each of the conference sites includes at least one conference terminal, and claim 8 A processing device for video and audio information in a video conference according to any of claims 14.
  • the multipoint control unit decodes the data stream sent by the conference terminal, and when the decoded data information is sign language information, Converting the sign language information into voice information, and processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing
  • At least two venues are provided, thereby solving the problem that the prior art deaf-mute people can not communicate freely and effectively when participating in multi-party remote video conferences.
  • FIG. 2 is a schematic diagram of remote communication between a deaf person and a normal person in the prior art
  • FIG. 3 is a flow chart of a method for processing video and audio information in a video conference according to Embodiment 1 of the present invention
  • FIG. 4 is a schematic diagram of a system for processing video and audio information in a video conference according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of image overlay according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of image overlay according to an embodiment of the present invention.
  • FIG. 7 is a structural diagram of an apparatus for processing video and audio information in a video conference according to Embodiment 2 of the present invention
  • FIG. 8 is a structural diagram of an apparatus for processing video and audio information in a video conference according to Embodiment 3 of the present invention.
  • FIG. 3 is a flowchart of a method for processing video and audio information in a video conference according to the embodiment of the present invention.
  • Multi-point control server the following is an example of a multi-point control unit (Mul t ipo int Control Uni t, MCU for short). As shown in FIG. 3, this embodiment includes the following steps:
  • Step 310 Receive a data code stream sent by at least two conference terminals, and decode the data code stream to obtain at least two pieces of decoding information.
  • FIG. 4 is a schematic diagram of a system for processing video and audio information in a video conference according to an embodiment of the present invention, where each conference terminal exists in each conference site of a multi-party conference, and the video conference Including at least two conference sites (as shown in the figure, Figure 4 includes four conference sites, each conference site includes one conference terminal.
  • each conference site includes at least one conference terminal
  • the conference The terminal is configured to collect and output audio and video information in the conference site, and pick up original user information, where the original user information is specifically a user's sign language information, voice information, etc., and the conference terminal encodes the original user information to form a data stream, and The data stream is sent to the MCU, and the MCU receives the data stream sent by the conference terminal.
  • the conference terminal refers to a device that has the functions of acquiring an image, picking up a sound, and accepting an external input function, and is responsible for transmitting the acquired video image to a display for display, and transmitting the received audio information to the speaker.
  • Play for example, a video terminal.
  • the MCU After receiving the data code stream, the MCU decodes the data code stream to obtain at least two pieces of decoding information, where the decoding information includes sign language information collected by the conference terminal, the picked up audio information, and the like.
  • Step 320 Determine, when at least two pieces of decoding information, that there is sign language information, convert the sign language information into voice information, and perform voice synthesis processing on the converted voice information to generate synthesized voice information.
  • the MCU converts the sign language information into voice information.
  • the MCU determines that the sign language information exists in at least two pieces of decoding information, and specifically, after the MCU performs decoding, According to the decoded data, when the MCU restores the decoded data to sign language information, the MCU converts the sign language information into voice information; when the MCU can restore the decoded data to audio information, the audio information is Mixing, or converting audio information into sign language information.
  • the sign language information is a gesture action performed by any deaf person collected at the conference terminal.
  • the deaf-mute person needs to express his or her opinion in the conference, the deaf-mute person faces the conference terminal to perform sign language expression, the conference The terminal collects the sign language information of the deaf and dumb person, and sends the sign language information to the MCU. After the MCU obtains the sign language information, the MCU converts the sign language information into the voice information.
  • the MCU After converting the sign language information into voice information, the MCU synthesizes the converted voice information to generate synthesized voice information.
  • the MCU determines whether the number of sites in which the terminal that sends the sign language information is located exceeds a first threshold; if the first threshold is not exceeded, the converted voice information is subjected to voice synthesis processing to generate synthesized voice information; The first threshold performs speech synthesis processing on the converted speech information that does not exceed the first threshold to generate synthesized speech information.
  • the first threshold is the number of sites of the maximum mix that the MCU can bear, and the site of the general maximum mix is a quad mix.
  • the MCU Before the MCU performs the conversion of the sign language information into the voice information, the MCU also records the first time value of each of the sign language information obtained after the decoding. The MCU records the first time value for selecting the voice information to participate in the mixing process according to the first time value of each of the recorded sign language information during the subsequent mixing process.
  • Step 330 Perform mixing processing on the generated synthesized voice information through other decoded audio information.
  • the MCU mixes the generated synthesized speech information with other decoded audio information, such that the users in the multi-party conference receive signals with satisfactory speech quality.
  • the MCU performs the mixing processing of the generated synthesized voice information with other decoded audio information as follows:
  • the MCU sorts the first time value in step 320 from large to small; according to the order of the first time value, the MCU performs the gain amplification according to the preset gain coefficient; and calculates other decoded
  • the energy value of the audio information is sorted according to the energy value from large to small, and the gain coefficient of the other decoded audio information is gain-amplified; the synthesized speech signal generated by the MCU after the gain processing
  • the audio information is mixed with other decoded audio information.
  • the number of pieces of audio information participating in the mixing process does not exceed the first threshold. Further, when the decoded data information in the decoded information does not include the audio information, the MCU performs a mixing process on the plurality of synthesized voice information generated, and the plurality of synthesized voice information participating in the mixing process does not exceed the first threshold.
  • the decoded data information in the decoding information includes sign language information and audio information
  • the MCU preferentially converts the sign language information. Synthesizing voice information for voice information, and mixing the synthesized voice information with other decoded audio information without exceeding the first threshold. When the synthesized voice information exceeds the first threshold, priority is only The synthesized voice information is mixed and the other decoded audio information is discarded, and the expression of the deaf-mute person is guaranteed to be prioritized, and the communication problem between the deaf-mute person and the normal person is solved.
  • Step 340 Send the audio information after the mixing to the at least two conference sites.
  • the MCU sends the audio information after the mixing to the at least two sites, where the at least two sites include the data stream.
  • the multipoint control unit decodes the data stream sent by the conference terminal, and when the decoded data information is sign language information, Converting the sign language information into voice information, and processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing
  • At least two venues are provided, thereby solving the problem that the prior art deaf-mute people can not communicate freely and effectively when participating in multi-party remote video conferences.
  • the embodiment of the present invention further includes: the step of the MCU receiving the user attribute information input by the user or sent by the conference terminal, and the MCU receiving the user attribute information, thereby generating and the user when performing the voice synthesis processing.
  • the synthesized voice information matched by the attribute information makes the listener in the venue feel real when listening, and enhances the sense of communication presence.
  • the MCU receives user attribute information input by the user or sent by the conference terminal.
  • the deaf-mute person may input its own attribute information into the MCU, where the user attribute information includes: gender, age, nationality, etc.; or, the deaf-mute person may input his or her attribute information into the MCU.
  • the conference terminal in the site is sent to the MCU by the conference terminal.
  • the MCU performs voice synthesis processing on the converted voice information according to the user attribute information, Synthetic speech information that matches the user attribute information is generated.
  • the MCU after converting the sign language information into the voice information, the MCU obtains the user attribute information corresponding to the sign language information according to the sign language information, and performs voice synthesis processing on the converted voice information according to the user attribute information, and generates the user attribute information. Matching synthesized voice information.
  • the sign language information collected by the conference terminal is made by a Chinese middle-aged male
  • the MCU converts the sign language information into the voice information
  • the user attribute information corresponding to the sign language information is obtained, and the user attribute information is obtained according to the user attribute information.
  • the MCU also adjusts the speech rate, tone, etc. of the synthesized speech information according to the speed of the sign language information, so that other normal people of the venue While listening. Feel more real and enhance the sense of communication.
  • the embodiment of the present invention further includes: converting the other decoded audio information involved in the mixing process into sign language information, and processing the converted sign language information, and the MCU executes the other by executing
  • the step of converting the decoded audio information into sign language information and processing the converted sign language information makes the deaf-mute people in the venue express their own wishes and also clarify the willingness of the normal person to express, thereby making the realization better. Dumb people communicate freely and effectively with normal people.
  • the MCU converts the other decoded audio information participating in the mixing process into sign language information.
  • the MCU converts other decoded audio information participating in the mixing process into sign language information, and does not convert other decoded audio information that does not participate in the mixing process.
  • the MCU performs scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to a preset ratio.
  • the MCU performs scaling processing on the converted sign language information according to a preset ratio according to the order of the energy values of the other decoded audio information in step 330. Further, the image corresponding to the converted sign language information having a large energy value is subjected to enlargement processing, and the image corresponding to the sign language information after the converted energy value is reduced.
  • the MCU superimposes the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.
  • the MCU uses the superimposition mode or the multi-picture mode to superimpose the scaled signed sign language information with the current image in the conference terminal for display in at least two conference sites.
  • the superimposition mode specifically refers to superimposing a plurality of sign language information on the current image in the conference terminal and presenting it in the display screen of the conference terminal, and the superposition mode will block the current image in the conference terminal, as shown in FIG. 5 .
  • the multi-picture mode specifically means that a plurality of sign language information and a current image in the conference terminal are presented together in a display screen of the conference terminal, and the multi-picture mode does not block the conference.
  • the current image in the terminal as shown in Figure 6.
  • the image of each site is obtained by converting audio information, and the image in the site 2 is converted to sign language because the corresponding audio information has the largest energy value. After the information, the image is enlarged. The image of the site 3 and the site 4 has the smallest energy value of the corresponding audio information. Therefore, after being converted into the sign language information, the image is reduced.
  • the same is not repeated here.
  • the deaf-mute person in the venue selectively views the will of the normal person in the plurality of venues while watching the will of the normal person, and according to the normal person
  • the audio information the focus of watching the willingness of normal people to express, so that the deaf-mute and the normal person can communicate freely and effectively.
  • the embodiment of the present invention further includes the step of converting the sign language information into the text information, and performing the step of converting the sign language information into the text information, thereby facilitating communication of the multi-party conference, especially when the sign language information is not in the When the main screen of the venue is played, text information can be used to support the communication of the conference.
  • the MCU converts the sign language information into text information, and superimposes the text information with the current image in the conference terminal for display in at least two venues.
  • the MCU converts the sign language information into text information, and after the conversion, performs image superimposition processing on the text information and the current image in the conference terminal.
  • the sign language information is converted into text information in the form of subtitles.
  • the text information can be used to assist the communication of the conference.
  • the deaf-mute people at the conference site 1 and the conference site 2 perform sign language expression.
  • the normal people at the conference site 3 and the conference site 4 express the voice through the conference terminals at the conference site 1 and the conference site 2, and collect the sign language information of the deaf and mute.
  • the user terminal belonging to the conference site 3 and the conference site 4 picks up the voice of the normal person, and after encoding, sends the voice to the MCU.
  • the first sign language information of the site 1 and the second sign language information of the site 2 are obtained.
  • the MCU records the first time value of each sign language information, and converts the sign language information into voice information, that is, the first sign language information is converted into the first voice.
  • the second sign language information is converted into the second voice information.
  • the number of sites in which the MCU explicitly sends the sign language information is 2, compared with the first threshold, and the number is preset.
  • a threshold of 4 indicates that the maximum number of mixing sites that the MCU can withstand is four.
  • the MCU performs voice synthesis processing on the first voice information and the second voice according to the previously received user attribute information, and generates a first synthesis that matches the user attribute information, because the number of the site that sends the sign language information does not exceed the first threshold. Voice information and second synthesized voice information.
  • the MCU sends the first time value to the first time value.
  • the sign language information sent by the site 1 is sent before the site 2, and the MCU respectively presets the first synthesized voice information and the second synthesized voice information according to a preset gain.
  • the gain is performed by the coefficient, for example, the gain coefficient of the first synthesized speech information is 1, and the gain coefficient of the second synthesized speech information is 0.8; the MCU also calculates the energy value of the first audio information and the second audio information, according to the energy value.
  • the energy value of the first audio information is greater than the energy value of the second audio information, and the gain coefficients of the first audio information and the second audio information are gain-amplified, for example, the first audio information.
  • Gain amplification is 0.8.
  • the second audio information gain is amplified by 0.6.
  • the MCU uses the mixing mode to perform mixing processing on the synthesized speech information and audio information generated after the gain processing.
  • the audio information after the mixing is sent to at least two sites.
  • the MCU converts the first audio information and the second audio involved in the mixing process into the converted first sign language information and the converted second sign language information, and the MCU calculates the energy value according to the first audio information and the second audio information. Sorting, converting the converted sign language information according to a preset ratio, for example, in the embodiment of the present invention, since the energy value of the first audio information is large, the converted first sign language information is amplified, The converted second sign language information is reduced.
  • the MCU superimposes the scaled first sign language information and the second sign language information with the current image in the conference terminal for display in at least two conference sites.
  • the MCU may further convert the first sign language information and the second sign language information into text information, and after the converting, perform image superimposition processing on the first text information and the second text information and the current image in the conference terminal.
  • the first sign language information and the second sign language information are converted into the first text information and the second text information in the form of subtitles, when the first sign language information and the second sign language information of the deaf person are not played in the main screen of the venue,
  • the first text information and the second text information can be used to assist in the communication of the meeting.
  • the second embodiment of the present invention further provides a device for processing video and audio information in a video conference, which is used to implement the method for processing video and audio information in the video conference in the first embodiment, as shown in FIG.
  • the processing device of the information includes: a decoding unit 710, a conversion synthesizing unit 720, a mixing processing unit 730, and a transmitting unit 740.
  • the decoding unit 710 is configured to receive a data code stream sent by at least two conference terminals, and decode the data code stream to obtain at least two pieces of decoding information.
  • the conversion synthesizing unit 720 is configured to: when the sign language information is included in the at least two pieces of decoding information in the decoding unit, convert the sign language information into voice information, and perform speech synthesis processing on the converted voice information to generate Synthesizing voice information;
  • the mixing processing unit 730 is configured to perform mixing processing on the synthesized voice information generated by the conversion combining unit and other decoded audio information.
  • the sending unit 740 is configured to send the audio information after the mixing in the mixing processing unit to the at least two conference sites.
  • the decoding unit 710 is further configured to: receive user attribute information that is input by the user or sent by the conference terminal;
  • the conversion synthesizing unit 720 is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.
  • the device further includes: a determining unit 750, configured to determine whether the number of the site where the conference terminal that sends the sign language information is greater than a first threshold, and send the determination result to the conversion synthesis unit;
  • the conversion synthesizing unit 720 is specifically configured to, when receiving the determination result that the determining unit determines that the number of the conference sites does not exceed the first threshold, perform voice synthesis processing on the converted voice information to generate a composite voice message;
  • the conversion synthesizing unit 720 is specifically configured to: when the determining unit determines that the number of the conference sites exceeds the first threshold, the converted voice information that does not exceed the first threshold The speech synthesis process is performed to generate synthesized speech information.
  • the conversion synthesizing unit 720 is further configured to record a first time value of each of the sign language information obtained after decoding;
  • the mixing processing unit 730 is specifically configured to: sort the first time value from large to small; according to the ordering of the first time value, perform the generated synthesized voice information according to a preset gain coefficient.
  • Gain amplification
  • the device further includes: a sign language conversion unit 760, configured to convert the other decoded audio information participating in the mixing process into sign language information;
  • the scaling processing unit 770 is configured to perform scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to a preset ratio;
  • the superimposing unit 780 is configured to superimpose the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.
  • the device further includes: a text conversion unit 790, configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal for displaying in at least two conference sites .
  • a text conversion unit 790 configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal for displaying in at least two conference sites .
  • the processing device that applies the video and audio information in the video conference provided by the embodiment of the present invention, after the processing device decodes the data code stream sent by the conference terminal, and converts the sign language information into the voice information when the decoded data information is the sign language information. And processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing to at least two venues, thereby solving
  • deaf and dumb people participate in multi-party remote video conferences, they are unable to communicate freely and effectively.
  • the apparatus for processing video and audio information in the video conference may be implemented as follows to implement the method for processing video and audio information in the video conference in the first embodiment of the present invention, as shown in FIG. 8.
  • the processing device of the information includes: a network interface 810, a processor 820, and a memory 830.
  • System bus 840 is used to connect network interface 810, processor 820, and memory 830.
  • Network interface 810 is used to communicate with the conference terminal.
  • Memory 830 can be a persistent storage, such as a hard drive and flash memory, with software modules and device drivers in memory 830.
  • the software modules are capable of executing the various functional modules of the above described method of the present invention; the device drivers can be network and interface drivers.
  • Determining when the sign language information exists in the at least two pieces of decoded information, converting the sign language information into voice information, and performing speech synthesis processing on the converted voice information to generate synthesized speech information; and generating the synthesized speech
  • the information is mixed with other decoded audio information; the audio information after the mixing is sent to the at least two sites.
  • the processor accesses the software component of the memory 830, the data stream sent by the at least two conference terminals is received, and the data stream is decoded to obtain at least two channels of decoding. Execute the instructions of the following procedure before the information instruction:
  • Performing voice synthesis processing on the converted voice information to generate synthesized voice information is specifically:
  • the processor accesses the software component of the memory 830, performing a voice synthesis process on the converted voice information to generate a specific instruction of the process of synthesizing the voice information:
  • the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.
  • a specific instruction for performing a mixing process of synthesizing the generated synthesized speech information with other decoded audio information is performed:
  • the processing device that applies the video and audio information in the video conference provided by the embodiment of the present invention, after the processing device decodes the data code stream sent by the conference terminal, and converts the sign language information into the voice information when the decoded data information is the sign language information. And processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing to at least two venues, thereby solving
  • deaf and dumb people participate in multi-party remote video conferences, they are unable to communicate freely and effectively.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Abstract

The embodiments of the present invention relate to a method, device and system for processing video/audio information in a video conference. The method includes: receiving data code streams transmitted from at least two conference terminals, and decoding the data code streams to obtain at least two channels of decoded information; when determining that sign language information exists in the at least two channels of decoded information, converting the sign language into voice information, and performing voice synthesis on the converted voice information to generate synthetic voice information; performing audio mixing on the generated synthetic voice information with other decoded audio information; and transmitting the audio-mixed audio information to at least two conference sites.

Description

视频会议中的视音频信息的处理方法、 装置及系统 本申请要求于 2012年 12月 21 日提交中国专利局、 申请号为  Method, device and system for processing video and audio information in video conference The application is submitted to the Chinese Patent Office on December 21, 2012, and the application number is
201210560387.7、 发明名称为 "视频会议中的视音频信息的处理方法、 装置 及系统" 的中国专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域 本发明涉及通讯技术领域, 尤其涉及一种视频会议中的视音频信息的处理 方法、 装置及系统。 背景技术 随着社会的发展和进步, 聋哑人士作为社会的弱势群体越来越受到社会的 重视和关注。在生活工作中, 聋哑人士与正常人之间互相交流也变得越来越多。 随着现代教育的普及, 手语已经成为聋哑人之间交流, 以及与正常人交流的一 种通用的方式。 但是, 手语需要进行专门的训练和学习, 正常人一般除了特殊 需要外, 掌握手语的人士相对较少, 造成正常人与聋哑人士的沟通障碍。 201210560387.7, the entire disclosure of which is incorporated herein by reference in its entirety in its entirety in the entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire all all all all all all all all each The present invention relates to the field of communications technologies, and in particular, to a method, device, and system for processing video and audio information in a video conference. BACKGROUND OF THE INVENTION With the development and progress of society, deaf-mute people are increasingly receiving attention and attention from society as a vulnerable group of society. In life and work, the exchanges between deaf and normal people have become more and more. With the popularity of modern education, sign language has become a common way of communicating between deaf and dumb people and communicating with normal people. However, sign language requires special training and learning. Normal people generally have relatively few people who have sign language in addition to special needs, which causes communication difficulties between normal people and deaf people.
目前, 聋哑人士之间或者聋哑人士与正常人远程交流时, 一般利用专用的 设备或系统实现, 如利用实现手语、 文字、 语音转换的终端, 或者让第三方介 入手语 /语音的互译工作,以此解决聋哑人士之间或者聋哑人士与正常人面对面 交流的问题。  At present, when deaf-mute people or deaf-mute people communicate with normal people remotely, they usually use special equipment or systems, such as using terminals that implement sign language, text, and voice conversion, or let third parties intervene in sign language/voice translation. Work to solve the problem of face-to-face communication between deaf-mute people or deaf-mute people and normal people.
如图 1所示, 图 1 是聋哑人士之间进行远程交流的示意图。 聋哑人 A进行 手语表达, 利用视频通讯终端 A (如视频电话, 视讯会议终端, 桌面软终端等) 采集到其手语图像后, 经通讯网络传输到视频通讯终端 B , 聋哑人 B通过视频 通讯终端 B的呈现, 看到聋哑人 A的手语图像, 理解对方表达的意思。 反之亦 然, 进而完成整个交流过程。  As shown in Figure 1, Figure 1 is a schematic diagram of remote communication between deaf and dumb people. The deaf-mute A performs sign language expression, and uses the video communication terminal A (such as video phone, video conference terminal, desktop soft terminal, etc.) to collect the sign language image, and then transmits it to the video communication terminal B via the communication network, and the deaf-mute B passes the video. The presentation of the communication terminal B, seeing the sign language image of the deaf-mute A, understanding the meaning of the other party's expression. The reverse is also true, and the entire communication process is completed.
如图 2所示, 图 2 是聋哑人士与正常人进行远程交流的示意图, 聋哑人的 手语图像由视频通讯终端 A采集, 经过多点控制单元分发给正常人和翻译者所 在的视频通讯终端 B、 C中呈现; 经翻译者理解之后, 将其翻译成语音被视频通 讯终端拾取, 并通过多点控制单元分发给正常人所在的视频通讯终端 B中, 正 常人通过翻译者的语音, 理解了聋哑人表达的内容。  As shown in Fig. 2, Fig. 2 is a schematic diagram of remote communication between a deaf person and a normal person. The sign language image of the deaf person is collected by the video communication terminal A, and distributed to the video communication of the normal person and the translator through the multipoint control unit. Presented in the terminals B and C; after being interpreted by the translator, the translated voice is picked up by the video communication terminal, and distributed to the video communication terminal B where the normal person is located through the multi-point control unit, and the normal person passes the voice of the translator. Understand the content expressed by deaf people.
正常人的语音由视频通讯终端 B拾取, 并通过多点控制单元分发给翻译者 所在的视频通讯终端 C中, 翻译者将其翻译成手语, 手语图像经视频通讯终端 采集, 通过多点控制单元分发给聋哑人, 聋哑人通过翻译者的手语图像, 理解 了正常人表达的内容。 The normal person's voice is picked up by the video communication terminal B and distributed to the translator through the multipoint control unit. In the video communication terminal C, the translator translates it into sign language, the sign language image is collected by the video communication terminal, and distributed to the deaf and mute person through the multi-point control unit, and the deaf-mute person understands the normal person expression through the translator's sign language image. Content.
随着聋哑人士与正常人之间交流和沟通的逐渐增长, 现有技术也暴露出以 下弊端: 1 )聋哑人士与正常人进行交流时, 每次会议均需要第三方翻译者的参 与, 增加了沟通的人力成本; 2 )当多方会议中, 若有多个聋哑人士出示手语动 作或者多个正常人同时讲话时, 翻译者针对这种情况无法很好的处理, 以清晰 的表达出各个讲话人的内容。 因此, 现有技术存在一定的局限性, 没有解决有 聋哑人士参与多方会议交流面临的问题。 发明内容 本发明的目的是为了解决现有技术中聋哑人士参与多方远程视频会议时, 无法自由有效交流的问题, 提供了一种视频会议中的视音频信息的处理方法、 装置与系统。  With the gradual increase in communication and communication between deaf-mute people and normal people, the prior art also exposes the following drawbacks: 1) When deaf-mute people communicate with normal people, each meeting requires the participation of third-party translators. Increase the labor cost of communication; 2) In a multi-party conference, if there are multiple deaf-mute people presenting sign language actions or multiple normal people speaking at the same time, the translator can not deal with this situation well, clearly expressed The content of each speaker. Therefore, the prior art has certain limitations and does not solve the problems faced by deaf-mute people in the multi-party conference exchange. SUMMARY OF THE INVENTION The object of the present invention is to solve the problem that the deaf-mute person can't communicate freely and effectively when participating in multi-party remote video conference in the prior art, and provides a method, device and system for processing video and audio information in a video conference.
在第一方面, 本发明实施例提供了一种视频会议中的视音频信息的处理方 法, 所述视频会议包括至少两个会场, 所述每一会场至少包括一个会议终端, 所述方法包括:  In a first aspect, an embodiment of the present invention provides a method for processing video and audio information in a video conference, where the video conference includes at least two conference sites, and each conference site includes at least one conference terminal, and the method includes:
接收至少两个会议终端发送的数据码流, 并将所述数据码流进行解码, 得 到至少两路解码信息;  Receiving a data code stream sent by at least two conference terminals, and decoding the data code stream to obtain at least two pieces of decoding information;
确定所述至少两路解码信息中存在手语信息时, 将所述手语信息转换为语 音信息, 并对所述转换后的语音信息进行语音合成处理, 生成合成语音信息; 将所述生成的合成语音信息同其他解码后的音频信息进行混音处理; 将混音之后的音频信息发送给所述至少两个会场。  Determining, when the sign language information exists in the at least two pieces of decoded information, converting the sign language information into voice information, and performing speech synthesis processing on the converted voice information to generate synthesized speech information; and generating the synthesized speech The information is mixed with other decoded audio information; the audio information after the mixing is sent to the at least two sites.
在第一种可能的实现方式中,所述接收至少两个会议终端发送的数据码流, 并将所述数据码流进行解码, 得到至少两路解码信息之前还包括:  In a first possible implementation, the receiving the data code stream sent by the at least two conference terminals, and decoding the data code stream to obtain at least two pieces of decoding information, further includes:
接收用户输入的或者所述会议终端发送的用户属性信息;  Receiving user attribute information input by the user or sent by the conference terminal;
所述对所述转换后的语音信息进行语音合成处理, 生成合成语音信息具体 为:  Performing voice synthesis processing on the converted voice information to generate synthesized voice information is specifically:
根据所述用户属性信息, 对所述转换后的语音信息进行语音合成处理, 生 成与所述用户属性信息匹配的合成语音信息。  And performing voice synthesis processing on the converted voice information according to the user attribute information to generate synthesized voice information that matches the user attribute information.
在第二种可能的实现方式中, 所述对所述转换后的语音信息进行语音合成 处理, 生成合成语音信息具体为: In a second possible implementation manner, the performing voice synthesis on the converted voice information Processing, generating synthetic voice information is specifically:
判断发送所述手语信息的所述会议终端所在的会场个数是否超过第一阈 值;  Determining whether the number of sites where the conference terminal that sends the sign language information is located exceeds a first threshold;
如果不超过所述第一阈值, 则将所述转换后的所述语音信息进行语音合成 处理, 生成合成语音信息;  If the first threshold is not exceeded, performing the voice synthesis processing on the converted voice information to generate synthesized voice information;
如果超过所述第一阈值, 则将不超过所述第一阈值的所述转换后的所述语 音信息进行语音合成处理, 生成合成语音信息。  If the first threshold is exceeded, the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.
在第三种可能的实现方式中, 所述将所述手语信息转换为语音信息之前还 包括: 记录解码后得到的每个所述手语信息的第一时刻值;  In a third possible implementation manner, before the converting the sign language information into the voice information, the method further includes: recording a first time value of each of the sign language information obtained after decoding;
所述将所述生成的合成语音信息同其他解码后的音频信息进行混音处理具 体为:  The mixing and processing the generated synthesized speech information with other decoded audio information is:
对所述第一时刻值进行从大至小排序;  Sorting the first time value from large to small;
根据所述第一时刻值的排序, 将所述生成的合成语音信息按照预设的增益 系数进行增益放大;  And generating, according to the order of the first time values, the generated synthesized voice information according to a preset gain coefficient;
计算所述其他解码后的音频信息的能量值,按照所述能量值从大至小排序, 将所述其他解码后的音频信息的增益系数进行增益放大;  Calculating energy values of the other decoded audio information, sorting the energy values from large to small, and performing gain amplification on the gain coefficients of the other decoded audio information;
对增益处理后的所述生成的合成语音信息和所述其他解码后的音频信息进 行混音处理, 将混音后的音频信息发送给所述至少两个会场。  And mixing the generated synthesized speech information after the gain processing and the other decoded audio information, and transmitting the mixed audio information to the at least two conference sites.
结合第一方面或第一方面的第三种可能的实现方式, 在在第四种可能的实 现方式中, 所述方法还包括:  With reference to the first aspect or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the method further includes:
将参与混音处理的所述其他解码后的音频信息转换为手语信息;  Converting the other decoded audio information participating in the mixing process into sign language information;
根据所述其他解码后的音频信息的能量值的排序, 按照预设的比例将所述 转换后的所述手语信息进行缩放处理;  And converting, according to a preset ratio, the converted sign language information according to an order of energy values of the other decoded audio information;
将经过缩放处理后的所述手语信息与所述会议终端中的当前图像进行叠 加, 用以在至少两个会场显示。  And adding the scaled processed sign language information to the current image in the conference terminal for display in at least two conference sites.
在第五种可能的实现方式中, 所述方法还包括: 将所述手语信息转换为文 本信息, 并将所述文本信息与所述会议终端中的当前图像进行叠加处理, 用以 在至少两个会场显示。  In a fifth possible implementation, the method further includes: converting the sign language information into text information, and superimposing the text information with a current image in the conference terminal, for at least two A venue is displayed.
在第二方面, 本发明实施例提供了一种视频会议中的视音频信息的处理装 置, 所述装置包括:  In a second aspect, an embodiment of the present invention provides a processing device for video and audio information in a video conference, where the device includes:
解码单元, 用于接收至少两个会议终端发送的数据码流, 并将所述数据码 流进行解码, 得到至少两路解码信息; a decoding unit, configured to receive a data code stream sent by at least two conference terminals, and use the data code The stream is decoded to obtain at least two pieces of decoded information;
转换合成单元, 用于确定所述解码单元中的至少两路解码信息中存在手语 信息时, 将所述手语信息转换为语音信息, 并对所述转换后的语音信息进行语 音合成处理, 生成合成语音信息;  a conversion synthesizing unit, configured to: when the sign language information exists in at least two pieces of decoding information in the decoding unit, convert the sign language information into voice information, and perform speech synthesis processing on the converted voice information to generate a composite voice message;
混音处理单元, 用于将所述转换合成单元生成的合成语音信息同其他解码 后的音频信息进行混音处理;  a mixing processing unit, configured to perform mixing processing on the synthesized voice information generated by the conversion combining unit and other decoded audio information;
发送单元, 用于混音处理单元中混音之后的音频信息发送给所述至少两个 会场。  And a sending unit, configured to send the audio information after the mixing in the mixing processing unit to the at least two conference sites.
在第一种可能的实现方式中, 所述解码单元还用于, 接收用户输入的或者 所述会议终端发送的用户属性信息;  In a first possible implementation manner, the decoding unit is further configured to: receive user attribute information that is input by a user or sent by the conference terminal;
所述转换合成单元具体用于, 根据所述用户属性信息, 对所述转换后的语 音信息进行语音合成处理, 生成与所述用户属性信息匹配的合成语音信息。  The conversion synthesizing unit is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.
在第二种可能的实现方式中, 所述装置还包括:  In a second possible implementation manner, the device further includes:
判断单元, 用于判断发送所述手语信息的所述会议终端所在的会场个数是 否超过第一阈值, 并将判断结果发送给所述转换合成单元;  a determining unit, configured to determine whether the number of sites where the conference terminal that sends the sign language information is located exceeds a first threshold, and sends the determination result to the conversion synthesizing unit;
所述转换合成单元具体用于, 当接收到所述判断单元判断所述会场个数不 超过所述第一阈值的判断结果时,将转换后的所述语音信息进行语音合成处理, 生成合成语音信息;  The conversion synthesizing unit is configured to: when receiving the determination result that the determining unit determines that the number of the meeting sites does not exceed the first threshold, performing the speech synthesis processing on the converted speech information to generate a synthesized speech Information
所述转换合成单元具体用于, 当接收到所述判断单元判断所述会场个数超 过所述第一阈值的判断结果时, 将不超过所述第一阈值的转换后的所述语音信 息进行语音合成处理, 生成合成语音信息。  The conversion synthesizing unit is configured to: when receiving the determination result that the determining unit determines that the number of the site exceeds the first threshold, performing the converted voice information that does not exceed the first threshold Speech synthesis processing to generate synthesized speech information.
在第三种可能的实现方式中所述转换合成单元还用于, 记录解码后得到的 每个所述手语信息的第一时刻值;  In a third possible implementation manner, the conversion synthesizing unit is further configured to: record a first time value of each of the sign language information obtained after decoding;
所述混音处理单元具体用于: 对所述第一时刻值进行从大至小排序; 根据所述第一时刻值的排序, 将所述生成的合成语音信息按照预设的增益 系数进行增益放大;  The mixing processing unit is specifically configured to: sort the first time value from large to small; according to the ordering of the first time value, perform the gain according to the preset gain coefficient according to a preset gain coefficient Zoom in
计算所述其他解码后的音频信息的能量值,按照所述能量值从大至小排序, 将所述其他解码后的音频信息的增益系数进行增益放大;  Calculating energy values of the other decoded audio information, sorting the energy values from large to small, and performing gain amplification on the gain coefficients of the other decoded audio information;
对增益处理后的所述生成的合成语音信息和所述其他解码后的音频信息进 行混音处理, 将混音之后的音频信息发送给所述至少两个会场。  And mixing the generated synthesized speech information after the gain processing and the other decoded audio information, and transmitting the audio information after the mixing to the at least two conference sites.
结合第二方面或第二方面的第三种可能的实现方式, 在第四种可能的实现 方式中, 所述装置还包括: In combination with the second aspect or the third possible implementation of the second aspect, in a fourth possible implementation In the mode, the device further includes:
手语转换单元, 用于将参与混音处理的所述其他解码后的音频信息转换为 手语信息;  a sign language conversion unit, configured to convert the other decoded audio information participating in the mixing process into sign language information;
缩放处理单元, 用于根据所述计算出的所述其他解码后的音频信息的能量 值, 按照预设的比例将所述转换后的所述手语信息进行缩放处理;  a scaling processing unit, configured to perform scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to the preset ratio;
叠加单元, 用于将经过缩放处理后的所述手语信息与所述会议终端中的当 前图像进行叠加, 用以在至少两个会场显示。  And a superimposing unit, configured to superimpose the scaled processed sign language information and the current image in the conference terminal to be displayed in at least two conference sites.
在第五种可能的实现方式中, 所述装置还包括: 文本转换单元, 用于将所 述手语信息转换为文本信息, 并将所述文本信息与所述会议终端中的当前图像 进行叠加处理, 用以在至少两个会场显示。  In a fifth possible implementation, the device further includes: a text conversion unit, configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal , used to display in at least two venues.
在第三方面, 本发明实施例提供了一种视频会议中的视音频信息的处理系 统, 所述系统包括: 至少两个会场, 所述每一个会场至少包括一个会议终端, 以及如权利要求 8至权利要求 14中任一权项所述的视频会议中视音频信息的处 理装置。  In a third aspect, an embodiment of the present invention provides a system for processing video and audio information in a video conference, where the system includes: at least two conference sites, each of the conference sites includes at least one conference terminal, and claim 8 A processing device for video and audio information in a video conference according to any of claims 14.
因此,通过应用本发明实施例提供的视频会议中的视音频信息的处理方法、 装置与系统, 多点控制单元将会议终端发送的数据码流解码后, 当解码后的数 据信息为手语信息时, 将手语信息转换为语音信息, 并对转换后的语音信息进 行处理后生成合成语音信息; 将生成的合成语音信息同其他解码后的音频信息 进行混音处理; 将混音之后的音频信息发送给至少两个会场, 从而解决了现有 技术中聋哑人士参与多方远程视频会议时, 无法自由有效交流的问题。 附图说明 图 1为现有技术中聋哑人士之间进行远程交流的示意图;  Therefore, by applying the method, device and system for processing video and audio information in a video conference provided by the embodiment of the present invention, the multipoint control unit decodes the data stream sent by the conference terminal, and when the decoded data information is sign language information, Converting the sign language information into voice information, and processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing At least two venues are provided, thereby solving the problem that the prior art deaf-mute people can not communicate freely and effectively when participating in multi-party remote video conferences. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of remote communication between deaf and dumb persons in the prior art;
图 2为现有技术中聋哑人士与正常人进行远程交流的示意图;  2 is a schematic diagram of remote communication between a deaf person and a normal person in the prior art;
图 3 为本发明实施例一提供的视频会议中的视音频信息的处理方法流程 图;  3 is a flow chart of a method for processing video and audio information in a video conference according to Embodiment 1 of the present invention;
图 4为本发明实施例提供的视频会议中的视音频信息的处理系统示意图; 图 5为本发明实施例提供的图像叠加示意图;  4 is a schematic diagram of a system for processing video and audio information in a video conference according to an embodiment of the present invention; FIG. 5 is a schematic diagram of image overlay according to an embodiment of the present invention;
图 6为本发明实施例提供的图像叠加示意图;  FIG. 6 is a schematic diagram of image overlay according to an embodiment of the present invention;
图 7 为本发明实施例二提供的视频会议中的视音频信息的处理装置结构 图; 图 8 为本发明实施例三提供的视频会议中的视音频信息的处理装置结构 图。 具体实施方式 为使本发明的目的、 技术方案和优点更加清楚, 下面结合附图对本发明具 体实施例作进一步的详细描述。 FIG. 7 is a structural diagram of an apparatus for processing video and audio information in a video conference according to Embodiment 2 of the present invention; FIG. FIG. 8 is a structural diagram of an apparatus for processing video and audio information in a video conference according to Embodiment 3 of the present invention. DETAILED DESCRIPTION OF THE EMBODIMENTS In order to make the objects, technical solutions and advantages of the present invention more comprehensible, the embodiments of the present invention will be further described in detail below.
下面以图 3为例说明本发明实施例提供的信息的处理方法, 图 3为本发明 实施例一提供的视频会议中的视音频信息的处理方法流程图, 在本发明实施例 中实施主体为多点控制服务器, 下面以多点控制单元 (Mul t ipo int Control Uni t , 简称: MCU ) 为例进行说明。 如图 3所示, 该实施例包括以下步骤:  The following is a description of the method for processing information provided by the embodiment of the present invention, and FIG. 3 is a flowchart of a method for processing video and audio information in a video conference according to the embodiment of the present invention. Multi-point control server, the following is an example of a multi-point control unit (Mul t ipo int Control Uni t, MCU for short). As shown in FIG. 3, this embodiment includes the following steps:
步骤 310、 接收至少两个会议终端发送的数据码流, 并将所述数据码流进 行解码, 得到至少两路解码信息。  Step 310: Receive a data code stream sent by at least two conference terminals, and decode the data code stream to obtain at least two pieces of decoding information.
具体地, 在多方会议中, 如图 4所示, 图 4为本发明实施例提供的视频会 议中的视音频信息的处理系统示意图, 各个会议终端存在于多方会议的各个会 场中, 且视频会议包括至少两个会场 (作为图示, 图 4中包括 4个会场, 每个 会场包括 1个会议终端, 可以理解, 实际应用中不限于 4个会场), 每一会场至 少包括一个会议终端, 会议终端用于采集及输出会场中的音视频信息、 拾取原 始用户信息, 所述原始用户信息具体为用户的手语信息、 语音信息等, 会议终 端将原始用户信息进行编码, 形成数据码流, 并将数据码流发送至 MCU , MCU 接收会议终端发送的数据码流。  Specifically, in a multi-party conference, as shown in FIG. 4, FIG. 4 is a schematic diagram of a system for processing video and audio information in a video conference according to an embodiment of the present invention, where each conference terminal exists in each conference site of a multi-party conference, and the video conference Including at least two conference sites (as shown in the figure, Figure 4 includes four conference sites, each conference site includes one conference terminal. It can be understood that the actual application is not limited to four conference sites), and each conference site includes at least one conference terminal, and the conference The terminal is configured to collect and output audio and video information in the conference site, and pick up original user information, where the original user information is specifically a user's sign language information, voice information, etc., and the conference terminal encodes the original user information to form a data stream, and The data stream is sent to the MCU, and the MCU receives the data stream sent by the conference terminal.
在本发明实施例中, 所述会议终端是指具有采集图像、 拾取声音、 接受外 部输入功能的设备, 并负责将获取的视频图像发送给显示器进行显示, 以及将 收到的音频信息发送给扬声器进行播放, 例如, 视讯终端。  In the embodiment of the present invention, the conference terminal refers to a device that has the functions of acquiring an image, picking up a sound, and accepting an external input function, and is responsible for transmitting the acquired video image to a display for display, and transmitting the received audio information to the speaker. Play, for example, a video terminal.
MCU接收到数据码流后, 对数据码流进行解码, 得到至少两路解码信息, 所述解码信息包括会议终端采集的手语信息, 拾取的音频信息等。  After receiving the data code stream, the MCU decodes the data code stream to obtain at least two pieces of decoding information, where the decoding information includes sign language information collected by the conference terminal, the picked up audio information, and the like.
步骤 320、 确定至少两路解码信息中存在手语信息时, 将所述手语信息转 换为语音信息, 并对所述转换后的语音信息进行语音合成处理, 生成合成语音 信息。  Step 320: Determine, when at least two pieces of decoding information, that there is sign language information, convert the sign language information into voice information, and perform voice synthesis processing on the converted voice information to generate synthesized voice information.
具体地, MCU进行解码后, MCU确定至少两路解码信息中存在手语信息时, 则 MCU将手语信息转换为语音信息。  Specifically, after the MCU performs decoding, when the MCU determines that there is sign language information in at least two pieces of decoding information, the MCU converts the sign language information into voice information.
所述 MCU确定至少两路解码信息中存在手语信息具体为, MCU进行解码后, 根据解码后的数据进行还原, 当 MCU将解码后的数据可以还原为手语信息时, 则 MCU将手语信息转换为语音信息; 当 MCU将解码后的数据可以还原为音频信 息时, 则对音频信息进行混音处理, 或者将音频信息转换为手语信息等。 The MCU determines that the sign language information exists in at least two pieces of decoding information, and specifically, after the MCU performs decoding, According to the decoded data, when the MCU restores the decoded data to sign language information, the MCU converts the sign language information into voice information; when the MCU can restore the decoded data to audio information, the audio information is Mixing, or converting audio information into sign language information.
进一步地,所述手语信息为在会议终端采集的任一聋哑人做出的手势动作, 当会场中聋哑人需要表达自身的意见时,则聋哑人面对会议终端进行手语表达, 会议终端采集聋哑人的手语信息, 经编码后发送给 MCU, MCU在进行解码后, 得 到手语信息, 则将手语信息转换为语音信息。  Further, the sign language information is a gesture action performed by any deaf person collected at the conference terminal. When the deaf-mute person needs to express his or her opinion in the conference, the deaf-mute person faces the conference terminal to perform sign language expression, the conference The terminal collects the sign language information of the deaf and dumb person, and sends the sign language information to the MCU. After the MCU obtains the sign language information, the MCU converts the sign language information into the voice information.
MCU将手语信息转换为语音信息后, 对转换后的语音信息进行语音合成, 生成合成语音信息。  After converting the sign language information into voice information, the MCU synthesizes the converted voice information to generate synthesized voice information.
所述 MCU对所述转换后的语音信息进行语音合成处理, 生成合成语音信息 具体为:  Performing, by the MCU, the voice synthesis processing on the converted voice information to generate synthesized voice information, specifically:
MCU 判断发送手语信息的所议终端所在的会场个数是否超过第一阈值; 如 果不超过所述第一阈值, 则将转换后的所述语音信息进行语音合成处理, 生成 合成语音信息; 如果超过所述第一阈值, 则将不超过所述第一阈值的转换后的 所述语音信息进行语音合成处理, 生成合成语音信息。  The MCU determines whether the number of sites in which the terminal that sends the sign language information is located exceeds a first threshold; if the first threshold is not exceeded, the converted voice information is subjected to voice synthesis processing to generate synthesized voice information; The first threshold performs speech synthesis processing on the converted speech information that does not exceed the first threshold to generate synthesized speech information.
在本发明实施例中,所述第一阈值为 MCU可以承受的最大混音的会场个数, 一般最大混音的会场为四方混音。  In the embodiment of the present invention, the first threshold is the number of sites of the maximum mix that the MCU can bear, and the site of the general maximum mix is a quad mix.
其中, MCU在执行将手语信息转换为语音信息之前, MCU还记录解码后得到 的每个手语信息的第一时刻值。 MCU 记录所述第一时刻值, 是用于在后续进行 混音处理过程中, 根据记录的每个手语信息的第一时刻值, 选择参与混音处理 的语音信息。  Before the MCU performs the conversion of the sign language information into the voice information, the MCU also records the first time value of each of the sign language information obtained after the decoding. The MCU records the first time value for selecting the voice information to participate in the mixing process according to the first time value of each of the recorded sign language information during the subsequent mixing process.
步骤 330、 将所述生成的合成语音信息通其他解码后的音频信息进行混音 处理。  Step 330: Perform mixing processing on the generated synthesized voice information through other decoded audio information.
具体地, MCU将生成的合成语音信息同其他解码后的音频信息进行混音处 理,所述混音处理为使得多方会议中的用户都接收到语音质量令人满意的信号。  Specifically, the MCU mixes the generated synthesized speech information with other decoded audio information, such that the users in the multi-party conference receive signals with satisfactory speech quality.
所述 MCU将所述生成的合成语音信息同其他解码后的音频信息进行混音处 理具体为:  The MCU performs the mixing processing of the generated synthesized voice information with other decoded audio information as follows:
MCU对步骤 320 中的第一时刻值进行从大至小排序; 根据所述第一时刻值 的排序, MCU将生成的合成语音信息按照预设的增益系数进行增益放大; 并计 算其他解码后的音频信息的能量值, 按照能量值从大至小排序, 将其他解码后 的音频信息的增益系数进行增益放大; MCU 对增益处理后的生成的合成语音信 息和其他解码后的音频信息进行混音处理。 The MCU sorts the first time value in step 320 from large to small; according to the order of the first time value, the MCU performs the gain amplification according to the preset gain coefficient; and calculates other decoded The energy value of the audio information is sorted according to the energy value from large to small, and the gain coefficient of the other decoded audio information is gain-amplified; the synthesized speech signal generated by the MCU after the gain processing The audio information is mixed with other decoded audio information.
在本发明实施例中, 参与混音处理的音频信息的个数不超过第一阈值。 进一步地, 当解码信息中解码后的数据信息不包括音频信息时, 则 MCU将 生成的多个合成语音信息进行混音处理, 且参与混音处理的多个合成语音信息 不超过第一阈值。  In the embodiment of the present invention, the number of pieces of audio information participating in the mixing process does not exceed the first threshold. Further, when the decoded data information in the decoded information does not include the audio information, the MCU performs a mixing process on the plurality of synthesized voice information generated, and the plurality of synthesized voice information participating in the mixing process does not exceed the first threshold.
更进一步地, 在本发明实施例中, 当解码信息中解码后的数据信息包括手 语信息和音频信息时, 则说明既存在聋哑人表达, 也存在正常人表达, 则 MCU 优先将手语信息转换为语音信息, 生成合成语音信息, 并在不超过第一阈值的 情况下, 将合成语音信息同其他解码后的音频信息进行混音处理, 当合成语音 信息超过第一阈值的情况下, 优先仅将合成语音信息进行混音处理, 将其他解 码后的音频信息舍弃, 已保证优先处理聋哑人的表达, 解决聋哑人与正常人的 交流问题。  Further, in the embodiment of the present invention, when the decoded data information in the decoding information includes sign language information and audio information, it indicates that there is both a deaf-mute expression and a normal person expression, and the MCU preferentially converts the sign language information. Synthesizing voice information for voice information, and mixing the synthesized voice information with other decoded audio information without exceeding the first threshold. When the synthesized voice information exceeds the first threshold, priority is only The synthesized voice information is mixed and the other decoded audio information is discarded, and the expression of the deaf-mute person is guaranteed to be prioritized, and the communication problem between the deaf-mute person and the normal person is solved.
步骤 340、 将混音之后的音频信息发送给所述至少两个会场。  Step 340: Send the audio information after the mixing to the at least two conference sites.
具体地, MCU在将生成的合成语音信息同其他解码后的音频信息进行混音 处理后, 将混音之后的音频信息发送到至少两个会场, 所述至少两个会场包括 发送数据码流的会场, 及未发送数据码流的会场。  Specifically, after the MCU performs the mixing process with the other decoded audio information, the MCU sends the audio information after the mixing to the at least two sites, where the at least two sites include the data stream. The venue, and the venue where the data stream is not sent.
因此,通过应用本发明实施例提供的视频会议中的视音频信息的处理方法、 装置与系统, 多点控制单元将会议终端发送的数据码流解码后, 当解码后的数 据信息为手语信息时, 将手语信息转换为语音信息, 并对转换后的语音信息进 行处理后生成合成语音信息; 将生成的合成语音信息同其他解码后的音频信息 进行混音处理; 将混音之后的音频信息发送给至少两个会场, 从而解决了现有 技术中聋哑人士参与多方远程视频会议时, 无法自由有效交流的问题。  Therefore, by applying the method, device and system for processing video and audio information in a video conference provided by the embodiment of the present invention, the multipoint control unit decodes the data stream sent by the conference terminal, and when the decoded data information is sign language information, Converting the sign language information into voice information, and processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing At least two venues are provided, thereby solving the problem that the prior art deaf-mute people can not communicate freely and effectively when participating in multi-party remote video conferences.
可选地, 在步骤 310之前, 本发明实施例还包括, MCU接收用户输入的或 者会议终端发送的用户属性信息的步骤, MCU通过接收用户属性信息, 从而在 进行语音合成处理时, 生成与用户属性信息匹配的合成语音信息, 使得会场中 的收听者在收听时感觉真实, 增强交流临场感。  Optionally, before the step 310, the embodiment of the present invention further includes: the step of the MCU receiving the user attribute information input by the user or sent by the conference terminal, and the MCU receiving the user attribute information, thereby generating and the user when performing the voice synthesis processing. The synthesized voice information matched by the attribute information makes the listener in the venue feel real when listening, and enhances the sense of communication presence.
MCU接收用户输入的或者所述会议终端发送的用户属性信息。  The MCU receives user attribute information input by the user or sent by the conference terminal.
具体地, 在多方会议开始前, 聋哑人可将自身的属性信息输入 MCU中, 所 述用户属性信息包括: 性别、 年龄、 国籍等; 或者, 聋哑人可将自身的属性信 息输入到其所在会场中的会议终端中, 由会议终端统一发送给 MCU。  Specifically, before the start of the multi-party conference, the deaf-mute person may input its own attribute information into the MCU, where the user attribute information includes: gender, age, nationality, etc.; or, the deaf-mute person may input his or her attribute information into the MCU. The conference terminal in the site is sent to the MCU by the conference terminal.
MCU根据所述用户属性信息, 对所述转换后的语音信息进行语音合成处理, 生成与所述用户属性信息匹配的合成语音信息。 The MCU performs voice synthesis processing on the converted voice information according to the user attribute information, Synthetic speech information that matches the user attribute information is generated.
具体地, MCU在将手语信息转换为语音信息后, 根据手语信息, 获取与手 语信息对应的用户属性信息, 并根据用户属性信息, 对转换后的语音信息进行 语音合成处理, 生成与用户属性信息匹配的合成语音信息。 例如, 会议终端采 集的手语信息是一位中国中年男性所做出的, 则 MCU在将手语信息转换为的语 音信息后, 则获取与该手语信息对应的用户属性信息, 并根据用户属性信息, 对转换后的语音信息进行语音合成处理, 生成与用户属性信息匹配的合成语音 信息, 同时 MCU还根据手语信息的快慢, 来调整合成语音信息的语速, 音调等, 使得其他会场的正常人在收听时。 感觉更真实, 增强交流临场感。  Specifically, after converting the sign language information into the voice information, the MCU obtains the user attribute information corresponding to the sign language information according to the sign language information, and performs voice synthesis processing on the converted voice information according to the user attribute information, and generates the user attribute information. Matching synthesized voice information. For example, the sign language information collected by the conference terminal is made by a Chinese middle-aged male, and after the MCU converts the sign language information into the voice information, the user attribute information corresponding to the sign language information is obtained, and the user attribute information is obtained according to the user attribute information. Performing speech synthesis processing on the converted speech information to generate synthesized speech information matching the user attribute information, and the MCU also adjusts the speech rate, tone, etc. of the synthesized speech information according to the speed of the sign language information, so that other normal people of the venue While listening. Feel more real and enhance the sense of communication.
可选地, 在步骤 340之后, 本发明实施例还包括, 将参与混音处理的其他 解码后的音频信息转换为手语信息, 及将转换后的手语信息进行处理的步骤, MCU通过执行将其他解码后的音频信息转换为手语信息, 及将转换后的手语信 息进行处理的步骤, 使得会场中的聋哑人在表达自身意愿的同时, 也明确正常 人表达的意愿, 使得更好地实现聋哑人与正常人自由有效地交流。  Optionally, after the step 340, the embodiment of the present invention further includes: converting the other decoded audio information involved in the mixing process into sign language information, and processing the converted sign language information, and the MCU executes the other by executing The step of converting the decoded audio information into sign language information and processing the converted sign language information makes the deaf-mute people in the venue express their own wishes and also clarify the willingness of the normal person to express, thereby making the realization better. Dumb people communicate freely and effectively with normal people.
MCU将参与混音处理的所述其他解码后的音频信息转换为手语信息。  The MCU converts the other decoded audio information participating in the mixing process into sign language information.
具体地, MCU将参与混音处理的其他解码后的音频信息转换为手语信息, 对未参与混音处理的其他解码后的音频信息则不进行转换。  Specifically, the MCU converts other decoded audio information participating in the mixing process into sign language information, and does not convert other decoded audio information that does not participate in the mixing process.
MCU根据所述计算出的所述其他解码后的音频信息的能量值, 按照预设的 比例将所述转换后的所述手语信息进行缩放处理。  The MCU performs scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to a preset ratio.
具体地, MCU根据在步骤 330中的其他解码后的音频信息的能量值的排序, 按照预设的比例将转换后的手语信息进行缩放处理。 进一步地, 对能量值大的 转换后的手语信息对应的图像进行放大处理, 对能量值小的转换后的手语信息 对应的图像进行缩小处理。  Specifically, the MCU performs scaling processing on the converted sign language information according to a preset ratio according to the order of the energy values of the other decoded audio information in step 330. Further, the image corresponding to the converted sign language information having a large energy value is subjected to enlargement processing, and the image corresponding to the sign language information after the converted energy value is reduced.
利用叠加模式或者多画面模式, MCU将经过缩放处理后的所述手语信息与 所述会议终端中的当前图像进行叠加, 用以在至少两个会场显示。  The MCU superimposes the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.
具体地, 利用叠加模式或者多画面模式, MCU 将经过缩放处理后的手语信 息与会议终端中的当前图像进行叠加, 用以在至少两个会场显示。  Specifically, using the superimposition mode or the multi-picture mode, the MCU superimposes the scaled signed sign language information with the current image in the conference terminal for display in at least two conference sites.
所述叠加模式具体是指将多个手语信息一同叠加在会议终端中的当前图像 上, 并呈现在会议终端的显示屏中, 此种叠加模式会遮挡到会议终端中的当前 图像, 如图 5所示; 所述多画面模式具体是指, 将多个手语信息及会议终端中 的当前图像一同呈现在会议终端的显示屏中, 此种多画面模式不会遮挡到会议 终端中的当前图像, 如图 6所示。 The superimposition mode specifically refers to superimposing a plurality of sign language information on the current image in the conference terminal and presenting it in the display screen of the conference terminal, and the superposition mode will block the current image in the conference terminal, as shown in FIG. 5 . The multi-picture mode specifically means that a plurality of sign language information and a current image in the conference terminal are presented together in a display screen of the conference terminal, and the multi-picture mode does not block the conference. The current image in the terminal, as shown in Figure 6.
例如, 如图 5所示的叠加模式, 且在图 5中, 各会场的图像为将音频信息 转换后得到的, 会场 2中的图像由于对应的音频信息能量值最大, 因此, 在转 换为手语信息后, 图像进行放大处理, 会场 3和会场 4的图像由于对应的音频 信息能量值最小, 因此, 在转换为手语信息后, 图像进行缩小处理。 再例如, 如图 6的多画面模式, 同理, 在此不再复述。  For example, in the superimposition mode shown in FIG. 5, and in FIG. 5, the image of each site is obtained by converting audio information, and the image in the site 2 is converted to sign language because the corresponding audio information has the largest energy value. After the information, the image is enlarged. The image of the site 3 and the site 4 has the smallest energy value of the corresponding audio information. Therefore, after being converted into the sign language information, the image is reduced. For another example, as shown in the multi-picture mode of FIG. 6, the same is not repeated here.
通过上述对转换后的手语信息进行放大或者缩小处理, 使得会场中的聋哑 人在观看正常人表达的意愿时, 选择性的同时观看多个会场中的正常人表达的 意愿, 且根据正常人的音频信息, 有重点的观看正常人表达的意愿, 使得更好 地实现聋哑人与正常人自由有效地交流。  Through the above-mentioned enlargement or reduction processing of the converted sign language information, the deaf-mute person in the venue selectively views the will of the normal person in the plurality of venues while watching the will of the normal person, and according to the normal person The audio information, the focus of watching the willingness of normal people to express, so that the deaf-mute and the normal person can communicate freely and effectively.
优选地, 在步骤 340之后, 本发明实施例还包括, 将手语信息转换为文本 信息的步骤, 通过执行手语信息转换为文本信息的步骤, 可辅助多方会议的沟 通, 尤其是在手语信息未在会场的主屏幕播放时, 利用文字信息可以辅助会议 的沟通交流。  Preferably, after step 340, the embodiment of the present invention further includes the step of converting the sign language information into the text information, and performing the step of converting the sign language information into the text information, thereby facilitating communication of the multi-party conference, especially when the sign language information is not in the When the main screen of the venue is played, text information can be used to support the communication of the conference.
MCU将所述手语信息转换为文本信息, 并将所述文本信息与所述会议终端 中的当前图像进行叠加处理, 用以在至少两个会场显示。  The MCU converts the sign language information into text information, and superimposes the text information with the current image in the conference terminal for display in at least two venues.
具体地, MCU将手语信息转换为文本信息, 在转换后, 将文本信息与会议 终端中的当前图像进行图像叠加处理。 手语信息转换为字幕形式的文本信息, 当聋哑人的手语信息没有在会场的主屏幕中播放时, 利用文本信息可以辅助会 议的沟通交流。  Specifically, the MCU converts the sign language information into text information, and after the conversion, performs image superimposition processing on the text information and the current image in the conference terminal. The sign language information is converted into text information in the form of subtitles. When the sign language information of the deaf person is not played in the main screen of the venue, the text information can be used to assist the communication of the conference.
为使本发明的目的、 技术方案和优点更加清楚, 下面结合附图对本发明具 体实施例作进一步的详细描述。  In order to make the objects, the technical solutions and the advantages of the present invention more comprehensible, the specific embodiments of the present invention are further described in detail below.
下面以一个具体的实施例为例说明本发明实施例提供的信息的处理方法。 例如, 多方会议中, 会场 1和会场 2的聋哑人进行手语表达, 会场 3和会 场 4的正常人通过语音表达, 归属于会场 1和会场 2的会议终端采集聋哑人手 语信息, 编码后, 发送给 MCU, 归属于会场 3和会场 4的用户终端拾取正常人 的说话声音, 编码后, 发送给 MCU, MCU解码后, 得到会场 1的第一手语信息和 会场 2的第二手语信息,得到会场 3的第一音频信息和会场 4的第二音频信息, MCU记录得到每个手语信息的第一时刻值, 并将手语信息转换为语音信息, 即 第一手语信息转换为第一语音信息, 第二手语信息转换为第二语音信息。  The following describes a method for processing information provided by an embodiment of the present invention by using a specific embodiment as an example. For example, in a multi-party conference, the deaf-mute people at the conference site 1 and the conference site 2 perform sign language expression. The normal people at the conference site 3 and the conference site 4 express the voice through the conference terminals at the conference site 1 and the conference site 2, and collect the sign language information of the deaf and mute. And sent to the MCU, the user terminal belonging to the conference site 3 and the conference site 4 picks up the voice of the normal person, and after encoding, sends the voice to the MCU. After decoding by the MCU, the first sign language information of the site 1 and the second sign language information of the site 2 are obtained. Obtaining the first audio information of the site 3 and the second audio information of the site 4. The MCU records the first time value of each sign language information, and converts the sign language information into voice information, that is, the first sign language information is converted into the first voice. Information, the second sign language information is converted into the second voice information.
MCU明确发送手语信息的会场个数为 2 , 与第一阈值进行比较,预先设定第 一阈值为 4 , 说明 MCU能承受的最大混音会场数为 4个会场。 由于发送手语信 息的会场个数不超过第一阈值, 则, MCU根据在先接收的用户属性信息, 将第 一语音信息和第二语音进行语音合成处理, 生成与用户属性信息匹配的第一合 成语音信息和第二合成语音信息。 The number of sites in which the MCU explicitly sends the sign language information is 2, compared with the first threshold, and the number is preset. A threshold of 4 indicates that the maximum number of mixing sites that the MCU can withstand is four. The MCU performs voice synthesis processing on the first voice information and the second voice according to the previously received user attribute information, and generates a first synthesis that matches the user attribute information, because the number of the site that sends the sign language information does not exceed the first threshold. Voice information and second synthesized voice information.
MCU对第一时刻值进行从大至小排序; 在本发明实施例中会场 1发送的手 语信息在会场 2之前发送, MCU分别将第一合成语音信息和第二合成语音信息 按照预设的增益系数进行增益放大, 如第一合成语音信息的增益系数为 1 , 第 二合成语音信息的增益系数为 0. 8; MCU还计算第一音频信息和第二音频信息的 能量值, 按照能量值从大至小排序, 在本发明实施例中第一音频信息的能量值 大于第二音频信息的能量值, 将第一音频信息和第二音频信息的增益系数进行 增益放大, 如, 第一音频信息增益放大 0. 8 , 第二音频信息增益放大 0. 6; MCU 利用混音模式对增益处理后的生成的合成语音信息和音频信息进行混音处理。  The MCU sends the first time value to the first time value. In the embodiment of the present invention, the sign language information sent by the site 1 is sent before the site 2, and the MCU respectively presets the first synthesized voice information and the second synthesized voice information according to a preset gain. The gain is performed by the coefficient, for example, the gain coefficient of the first synthesized speech information is 1, and the gain coefficient of the second synthesized speech information is 0.8; the MCU also calculates the energy value of the first audio information and the second audio information, according to the energy value. In the embodiment of the present invention, the energy value of the first audio information is greater than the energy value of the second audio information, and the gain coefficients of the first audio information and the second audio information are gain-amplified, for example, the first audio information. Gain amplification is 0.8. The second audio information gain is amplified by 0.6. The MCU uses the mixing mode to perform mixing processing on the synthesized speech information and audio information generated after the gain processing.
MCU进行混音处理完成后, 将混音之后的音频信息发送给至少两个会场。 同时, MCU还将参与混音处理的第一音频信息和第二音频转换为转换后的 第一手语信息和转换后的第二手语信息, MCU根据第一音频信息和第二音频信 息能量值的排序, 按照预设的比例将转换后的手语信息进行缩放处理, 如, 在 本发明实施例中, 由于第一音频信息的能量值大, 则将转换后的第一手语信息 进行放大处理, 将转换后的第二手语信息进行缩小处理。 利用叠加模式或者多 画面模式, MCU将经过缩放处理后的第一手语信息和第二手语信息与会议终端 中的当前图像进行叠加, 用以在至少两个会场显示。  After the MCU performs the mixing process, the audio information after the mixing is sent to at least two sites. At the same time, the MCU converts the first audio information and the second audio involved in the mixing process into the converted first sign language information and the converted second sign language information, and the MCU calculates the energy value according to the first audio information and the second audio information. Sorting, converting the converted sign language information according to a preset ratio, for example, in the embodiment of the present invention, since the energy value of the first audio information is large, the converted first sign language information is amplified, The converted second sign language information is reduced. Using the superimposition mode or the multi-picture mode, the MCU superimposes the scaled first sign language information and the second sign language information with the current image in the conference terminal for display in at least two conference sites.
进一步地, MCU还可将第一手语信息和第二手语信息转换为文本信息, 在 转换后, 将第一文本信息和第二文本信息与会议终端中的当前图像进行图像叠 加处理。 第一手语信息和第二手语信息转换为字幕形式的第一文本信息和第二 文本信息, 当聋哑人的第一手语信息和第二手语信息没有在会场的主屏幕中播 放时, 利用第一文本信息和第二文本信息可以辅助会议的沟通交流。  Further, the MCU may further convert the first sign language information and the second sign language information into text information, and after the converting, perform image superimposition processing on the first text information and the second text information and the current image in the conference terminal. The first sign language information and the second sign language information are converted into the first text information and the second text information in the form of subtitles, when the first sign language information and the second sign language information of the deaf person are not played in the main screen of the venue, The first text information and the second text information can be used to assist in the communication of the meeting.
相应地, 本发明实施例二还提供了一种视频会议中的视音频信息的处理装 置,用以实现实施例一中的视频会议中的视音频信息的处理方法,如图 7所示, 所述信息的处理装置包括: 解码单元 710、 转换合成单元 720、 混音处理单元 730和发送单元 740。  Correspondingly, the second embodiment of the present invention further provides a device for processing video and audio information in a video conference, which is used to implement the method for processing video and audio information in the video conference in the first embodiment, as shown in FIG. The processing device of the information includes: a decoding unit 710, a conversion synthesizing unit 720, a mixing processing unit 730, and a transmitting unit 740.
所述装置中解码单元 710 , 用于接收至少两个会议终端发送的数据码流, 并将所述数据码流进行解码, 得到至少两路解码信息; 转换合成单元 720, 用于确定所述解码单元中的至少两路解码信息中存在 手语信息时, 将所述手语信息转换为语音信息, 并对所述转换后的语音信息进 行语音合成处理, 生成合成语音信息; The decoding unit 710 is configured to receive a data code stream sent by at least two conference terminals, and decode the data code stream to obtain at least two pieces of decoding information. The conversion synthesizing unit 720 is configured to: when the sign language information is included in the at least two pieces of decoding information in the decoding unit, convert the sign language information into voice information, and perform speech synthesis processing on the converted voice information to generate Synthesizing voice information;
混音处理单元 730, 用于将所述转换合成单元生成的合成语音信息同其他 解码后的音频信息进行混音处理;  The mixing processing unit 730 is configured to perform mixing processing on the synthesized voice information generated by the conversion combining unit and other decoded audio information.
发送单元 740, 用于混音处理单元中混音之后的音频信息发送给所述至少 两个会场。  The sending unit 740 is configured to send the audio information after the mixing in the mixing processing unit to the at least two conference sites.
所述解码单元 710还用于, 接收用户输入的或者所述会议终端发送的用户 属性信息;  The decoding unit 710 is further configured to: receive user attribute information that is input by the user or sent by the conference terminal;
所述转换合成单元 720具体用于, 根据所述用户属性信息, 对所述转换后 的语音信息进行语音合成处理,生成与所述用户属性信息匹配的合成语音信息。  The conversion synthesizing unit 720 is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.
所述装置还包括: 判断单元 750, 用于判断发送所述手语信息的所述会议 终端所在的会场个数是否超过第一阈值, 并将判断结果发送给所述转换合成单 元;  The device further includes: a determining unit 750, configured to determine whether the number of the site where the conference terminal that sends the sign language information is greater than a first threshold, and send the determination result to the conversion synthesis unit;
所述转换合成单元 720具体用于, 当接收到所述判断单元判断所述会场个 数不超过所述第一阈值的判断结果时, 将转换后的所述语音信息进行语音合成 处理, 生成合成语音信息;  The conversion synthesizing unit 720 is specifically configured to, when receiving the determination result that the determining unit determines that the number of the conference sites does not exceed the first threshold, perform voice synthesis processing on the converted voice information to generate a composite voice message;
所述转换合成单元 720具体用于, 当接收到所述判断单元判断所述会场个 数超过所述第一阈值的判断结果时, 将不超过所述第一阈值的转换后的所述语 音信息进行语音合成处理, 生成合成语音信息。  The conversion synthesizing unit 720 is specifically configured to: when the determining unit determines that the number of the conference sites exceeds the first threshold, the converted voice information that does not exceed the first threshold The speech synthesis process is performed to generate synthesized speech information.
所述转换合成单元 720还用于, 记录解码后得到的每个所述手语信息的第 一时刻值;  The conversion synthesizing unit 720 is further configured to record a first time value of each of the sign language information obtained after decoding;
所述混音处理单元 730具体用于: 对所述第一时刻值进行从大至小排序; 根据所述第一时刻值的排序, 将所述生成的合成语音信息按照预设的增益 系数进行增益放大;  The mixing processing unit 730 is specifically configured to: sort the first time value from large to small; according to the ordering of the first time value, perform the generated synthesized voice information according to a preset gain coefficient. Gain amplification
计算所述其他解码后的音频信息的能量值,按照所述能量值从大至小排序, 将所述其他解码后的音频信息的增益系数进行增益放大;  Calculating energy values of the other decoded audio information, sorting the energy values from large to small, and performing gain amplification on the gain coefficients of the other decoded audio information;
对增益处理后的所述生成的合成语音信息和所述其他解码后的音频信息进 行混音处理, 将混音之后的音频信息发送给所述至少两个会场。  And mixing the generated synthesized speech information after the gain processing and the other decoded audio information, and transmitting the audio information after the mixing to the at least two conference sites.
所述装置还包括: 手语转换单元 760, 用于将参与混音处理的所述其他解 码后的音频信息转换为手语信息; 缩放处理单元 770, 用于根据所述计算出的所述其他解码后的音频信息的 能量值, 按照预设的比例将所述转换后的所述手语信息进行缩放处理; The device further includes: a sign language conversion unit 760, configured to convert the other decoded audio information participating in the mixing process into sign language information; The scaling processing unit 770 is configured to perform scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to a preset ratio;
叠加单元 780, 用于将经过缩放处理后的所述手语信息与所述会议终端中 的当前图像进行叠加, 用以在至少两个会场显示。  The superimposing unit 780 is configured to superimpose the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.
所述装置还包括: 文本转换单元 790, 用于将所述手语信息转换为文本信 息, 并将所述文本信息与所述会议终端中的当前图像进行叠加处理, 用以在至 少两个会场显示。  The device further includes: a text conversion unit 790, configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal for displaying in at least two conference sites .
通过应用本发明实施例提供的视频会议中的视音频信息的处理装置, 处理 装置将会议终端发送的数据码流解码后, 当解码后的数据信息为手语信息时, 将手语信息转换为语音信息, 并对转换后的语音信息进行处理后生成合成语音 信息; 将生成的合成语音信息同其他解码后的音频信息进行混音处理; 将混音 之后的音频信息发送给至少两个会场, 从而解决了现有技术中聋哑人士参与多 方远程视频会议时, 无法自由有效交流的问题。  The processing device that applies the video and audio information in the video conference provided by the embodiment of the present invention, after the processing device decodes the data code stream sent by the conference terminal, and converts the sign language information into the voice information when the decoded data information is the sign language information. And processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing to at least two venues, thereby solving In the prior art, when deaf and dumb people participate in multi-party remote video conferences, they are unable to communicate freely and effectively.
另外, 本发明实施例二提供的视频会议中的视音频信息的处理装置还可以 采用实现方式如下, 用以实现本发明实施例一中的视频会议中的视音频信息的 处理方法, 如图 8所示, 所述信息的处理装置包括: 网络接口 810、 处理器 820 和存储器 830。系统总线 840用于连接网络接口 810、处理器 820和存储器 830。  In addition, the apparatus for processing video and audio information in the video conference provided by the second embodiment of the present invention may be implemented as follows to implement the method for processing video and audio information in the video conference in the first embodiment of the present invention, as shown in FIG. 8. As shown, the processing device of the information includes: a network interface 810, a processor 820, and a memory 830. System bus 840 is used to connect network interface 810, processor 820, and memory 830.
网络接口 810用于与会议终端进行通信。  Network interface 810 is used to communicate with the conference terminal.
存储器 830可以是永久存储器, 例如硬盘驱动器和闪存, 存储器 830中具 有软件模块和设备驱动程序。 软件模块能够执行本发明上述方法的各种功能模 块; 设备驱动程序可以是网络和接口驱动程序。  Memory 830 can be a persistent storage, such as a hard drive and flash memory, with software modules and device drivers in memory 830. The software modules are capable of executing the various functional modules of the above described method of the present invention; the device drivers can be network and interface drivers.
在启动时, 这些软件组件被加载到存储器 830中, 然后被处理器 820访问 并执行如下指令:  At startup, these software components are loaded into memory 830, then accessed by processor 820 and executed as follows:
接收至少两个会议终端发送的数据码流, 并将所述数据码流进行解码, 得 到至少两路解码信息;  Receiving a data code stream sent by at least two conference terminals, and decoding the data code stream to obtain at least two pieces of decoding information;
确定所述至少两路解码信息中存在手语信息时, 将所述手语信息转换为语 音信息, 并对所述转换后的语音信息进行语音合成处理, 生成合成语音信息; 将所述生成的合成语音信息同其他解码后的音频信息进行混音处理; 将混音之后的音频信息发送给所述至少两个会场。  Determining, when the sign language information exists in the at least two pieces of decoded information, converting the sign language information into voice information, and performing speech synthesis processing on the converted voice information to generate synthesized speech information; and generating the synthesized speech The information is mixed with other decoded audio information; the audio information after the mixing is sent to the at least two sites.
进一步的, 所述处理器访问存储器 830的软件组件后, 在执行接收至少两 个会议终端发送的数据码流, 并将所述数据码流进行解码, 得到至少两路解码 信息指令之前执行以下过程的指令: Further, after the processor accesses the software component of the memory 830, the data stream sent by the at least two conference terminals is received, and the data stream is decoded to obtain at least two channels of decoding. Execute the instructions of the following procedure before the information instruction:
接收用户输入的或者所述会议终端发送的用户属性信息;  Receiving user attribute information input by the user or sent by the conference terminal;
所述对所述转换后的语音信息进行语音合成处理, 生成合成语音信息具体 为:  Performing voice synthesis processing on the converted voice information to generate synthesized voice information is specifically:
根据所述用户属性信息, 对所述转换后的语音信息进行语音合成处理, 生 成与所述用户属性信息匹配的合成语音信息。  And performing voice synthesis processing on the converted voice information according to the user attribute information to generate synthesized voice information that matches the user attribute information.
进一步的, 所述处理器访问存储器 830的软件组件后, 执行对所述转换后 的语音信息进行语音合成处理, 生成合成语音信息过程的具体指令:  Further, after the processor accesses the software component of the memory 830, performing a voice synthesis process on the converted voice information to generate a specific instruction of the process of synthesizing the voice information:
判断发送所述手语信息的所述会议终端所在的会场个数是否超过第一阈 值;  Determining whether the number of sites where the conference terminal that sends the sign language information is located exceeds a first threshold;
如果不超过所述第一阈值, 则将所述转换后的所述语音信息进行语音合成 处理, 生成合成语音信息;  If the first threshold is not exceeded, performing the voice synthesis processing on the converted voice information to generate synthesized voice information;
如果超过所述第一阈值, 则将不超过所述第一阈值的所述转换后的所述语 音信息进行语音合成处理, 生成合成语音信息。  If the first threshold is exceeded, the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.
进一步的, 所述处理器访问存储器 830的软件组件后, 在执行将所述手语 信息转换为语音信息指令之前执行以下过程的指令:  Further, after the processor accesses the software component of the memory 830, executing the following process before executing the conversion of the sign language information into the voice information instruction:
记录解码后得到的每个所述手语信息的第一时刻值;  Recording a first time value of each of the sign language information obtained after decoding;
执行将所述生成的合成语音信息同其他解码后的音频信息进行混音处理过 程的具体指令:  A specific instruction for performing a mixing process of synthesizing the generated synthesized speech information with other decoded audio information is performed:
对所述第一时刻值进行从大至小排序;  Sorting the first time value from large to small;
根据所述第一时刻值的排序, 将所述生成的合成语音信息按照预设的增益 系数进行增益放大;  And generating, according to the order of the first time values, the generated synthesized voice information according to a preset gain coefficient;
计算所述其他解码后的音频信息的能量值,按照所述能量值从大至小排序, 将所述其他解码后的音频信息的增益系数进行增益放大;  Calculating energy values of the other decoded audio information, sorting the energy values from large to small, and performing gain amplification on the gain coefficients of the other decoded audio information;
对增益处理后的所述生成的合成语音信息和所述其他解码后的音频信息进 行混音处理, 将混音后的音频信息发送给所述至少两个会场。  And mixing the generated synthesized speech information after the gain processing and the other decoded audio information, and transmitting the mixed audio information to the at least two conference sites.
进一步的, 所述处理器访问存储器 830的软件组件后, 执行以下过程的指 令:  Further, after the processor accesses the software component of the memory 830, the instructions of the following process are executed:
将参与混音处理的所述其他解码后的音频信息转换为手语信息;  Converting the other decoded audio information participating in the mixing process into sign language information;
根据所述其他解码后的音频信息的能量值的排序, 按照预设的比例将所述 转换后的所述手语信息进行缩放处理; 将经过缩放处理后的所述手语信息与所述会议终端中的当前图像进行叠 加, 用以在至少两个会场显示。 And performing, according to a preset ratio, the converted sign language information according to the order of the energy values of the other decoded audio information; And superimposing the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.
进一步的, 所述处理器访问存储器 830的软件组件后, 执行以下过程的指 令:  Further, after the processor accesses the software component of the memory 830, the instructions of the following process are executed:
将所述手语信息转换为文本信息, 并将所述文本信息与所述会议终端中的 当前图像进行叠加处理, 用以在至少两个会场显示。  Converting the sign language information into text information, and superimposing the text information with a current image in the conference terminal for display in at least two conference sites.
通过应用本发明实施例提供的视频会议中的视音频信息的处理装置, 处理 装置将会议终端发送的数据码流解码后, 当解码后的数据信息为手语信息时, 将手语信息转换为语音信息, 并对转换后的语音信息进行处理后生成合成语音 信息; 将生成的合成语音信息同其他解码后的音频信息进行混音处理; 将混音 之后的音频信息发送给至少两个会场, 从而解决了现有技术中聋哑人士参与多 方远程视频会议时, 无法自由有效交流的问题。  The processing device that applies the video and audio information in the video conference provided by the embodiment of the present invention, after the processing device decodes the data code stream sent by the conference terminal, and converts the sign language information into the voice information when the decoded data information is the sign language information. And processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing to at least two venues, thereby solving In the prior art, when deaf and dumb people participate in multi-party remote video conferences, they are unable to communicate freely and effectively.
专业人员应该还可以进一步意识到, 结合本文中所公开的实施例描述的各 示例的单元及算法步骤, 能够以电子硬件、计算机软件或者二者的结合来实现, 为了清楚地说明硬件和软件的可互换性, 在上述说明中已经按照功能一般性地 描述了各示例的组成及步骤。 这些功能究竟以硬件还是软件方式来执行, 取决 于技术方案的特定应用和设计约束条件。 专业技术人员可以对每个特定的应用 来使用不同方法来实现所描述的功能, 但是这种实现不应认为超出本发明的范 围。  A person skilled in the art should further appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software or a combination of both, in order to clearly illustrate hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、 处理器 执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器( RAM )、 内存、 只读存储器(ROM )、 电可编程 R0M、 电可擦除可编程 R0M、 寄存器、硬盘、 可移动磁盘、 CD-ROM, 或技术领域内所公知的任意其它形式的存储介质中。  The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.
以上所述的具体实施方式, 对本发明的目的、 技术方案和有益效果进行了 进一步详细说明, 所应理解的是, 以上所述仅为本发明的具体实施方式而已, 并不用于限定本发明的保护范围, 凡在本发明的精神和原则之内, 所做的任何 修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。  The above described embodiments of the present invention are further described in detail, and the embodiments of the present invention are intended to be illustrative only. The scope of the protection, any modifications, equivalents, improvements, etc., made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

权 利 要 求 Rights request
1、 一种视频会议中的视音频信息的处理方法, 其特征在于, 所述视频会议 包括至少两个会场, 所述每一会场至少包括一个会议终端, 所述方法包括: 接收至少两个会议终端发送的数据码流, 并将所述数据码流进行解码, 得 到至少两路解码信息; 1. A method for processing audio and video information in a video conference, characterized in that the video conference includes at least two conference sites, each of the conference sites includes at least one conference terminal, and the method includes: receiving at least two conference sites. The data code stream sent by the terminal is decoded to obtain at least two channels of decoded information;
确定所述至少两路解码信息中存在手语信息时, 将所述手语信息转换为语 音信息, 并对所述转换后的语音信息进行语音合成处理, 生成合成语音信息; 将所述生成的合成语音信息同其他解码后的音频信息进行混音处理; 将混音之后的音频信息发送给所述至少两个会场。 When it is determined that sign language information exists in the at least two channels of decoded information, the sign language information is converted into speech information, and the converted speech information is subjected to speech synthesis processing to generate synthetic speech information; and the generated synthetic speech is The information is mixed with other decoded audio information; and the mixed audio information is sent to the at least two venues.
2、根据权利要求 1所述的视频会议中的视音频信息的处理方法, 其特征在 于, 所述接收至少两个会议终端发送的数据码流, 并将所述数据码流进行解码, 得到至少两路解码信息之前还包括: 2. The method of processing audio and video information in a video conference according to claim 1, characterized in that: receiving data code streams sent by at least two conference terminals, and decoding the data code streams to obtain at least The two-way decoding information also includes:
接收用户输入的或者所述会议终端发送的用户属性信息; Receive user attribute information input by the user or sent by the conference terminal;
所述对所述转换后的语音信息进行语音合成处理, 生成合成语音信息具体 为: The converted speech information is subjected to speech synthesis processing to generate synthesized speech information specifically as follows:
根据所述用户属性信息, 对所述转换后的语音信息进行语音合成处理, 生 成与所述用户属性信息匹配的合成语音信息。 According to the user attribute information, speech synthesis processing is performed on the converted speech information to generate synthesized speech information that matches the user attribute information.
3、根据权利要求 1所述的视频会议中的视音频信息的处理方法, 其特征在 于, 所述对所述转换后的语音信息进行语音合成处理, 生成合成语音信息具体 为: 3. The method of processing audio and video information in a video conference according to claim 1, characterized in that, performing speech synthesis processing on the converted speech information, and generating the synthesized speech information is specifically:
判断发送所述手语信息的所述会议终端所在的会场个数是否超过第一阈 值; Determine whether the number of conference venues where the conference terminal sending the sign language information is located exceeds a first threshold;
如果不超过所述第一阈值, 则将所述转换后的所述语音信息进行语音合成 处理, 生成合成语音信息; If the first threshold is not exceeded, perform speech synthesis processing on the converted speech information to generate synthesized speech information;
如果超过所述第一阈值, 则将不超过所述第一阈值的所述转换后的所述语 音信息进行语音合成处理, 生成合成语音信息。 If it exceeds the first threshold, the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.
4、根据权利要求 1所述的视频会议中的视音频信息的处理方法, 其特征在 于, 所述将所述手语信息转换为语音信息之前还包括: 记录解码后得到的每个 所述手语信息的第一时刻值; 4. The method of processing audio and video information in a video conference according to claim 1, characterized in that, before converting the sign language information into speech information, the method further includes: recording each of the sign language information obtained after decoding. The first moment value;
所述将所述生成的合成语音信息同其他解码后的音频信息进行混音处理具 体为: The mixing process of the generated synthetic speech information and other decoded audio information is specifically:
对所述第一时刻值进行从大至小排序; 根据所述第一时刻值的排序, 将所述生成的合成语音信息按照预设的增益 系数进行增益放大; Sort the first time values from large to small; According to the sorting of the first time values, the generated synthesized speech information is gain amplified according to a preset gain coefficient;
计算所述其他解码后的音频信息的能量值,按照所述能量值从大至小排序, 将所述其他解码后的音频信息的增益系数进行增益放大; Calculate the energy values of the other decoded audio information, sort the energy values from large to small, and gain amplify the gain coefficients of the other decoded audio information;
对增益处理后的所述生成的合成语音信息和所述其他解码后的音频信息进 行混音处理, 将混音后的音频信息发送给所述至少两个会场。 Perform mixing processing on the generated synthesized speech information after gain processing and the other decoded audio information, and send the mixed audio information to the at least two conference venues.
5、根据权利要求 4所述的视频会议中的视音频信息的处理方法, 其特征在 于, 所述方法还包括: 5. The method for processing audio and video information in a video conference according to claim 4, characterized in that the method further includes:
将参与混音处理的所述其他解码后的音频信息转换为手语信息; Convert the other decoded audio information participating in the mixing process into sign language information;
根据所述其他解码后的音频信息的能量值的排序, 按照预设的比例将所述 转换后的所述手语信息进行缩放处理; According to the sorting of the energy values of the other decoded audio information, the converted sign language information is scaled according to a preset ratio;
将经过缩放处理后的所述手语信息与所述会议终端中的当前图像进行叠 加, 用以在至少两个会场显示。 The scaled sign language information is superimposed on the current image in the conference terminal for display in at least two conference sites.
6、根据权利要求 1所述的视频会议中的视音频信息的处理方法, 其特征在 于, 所述方法还包括: 6. The method for processing audio and video information in a video conference according to claim 1, characterized in that the method further includes:
将所述手语信息转换为文本信息, 并将所述文本信息与所述会议终端中的 当前图像进行叠加处理, 用以在至少两个会场显示。 Convert the sign language information into text information, and superimpose the text information with the current image in the conference terminal for display in at least two conference venues.
7、一种视频会议中的视音频信息的处理装置,其特征在于,所述装置包括: 解码单元, 用于接收至少两个会议终端发送的数据码流, 并将所述数据码 流进行解码, 得到至少两路解码信息; 7. A device for processing audio and video information in a video conference, characterized in that the device includes: a decoding unit, configured to receive data streams sent by at least two conference terminals and decode the data streams , obtain at least two channels of decoding information;
转换合成单元, 用于确定所述解码单元中的至少两路解码信息中存在手语 信息时, 将所述手语信息转换为语音信息, 并对所述转换后的语音信息进行语 音合成处理, 生成合成语音信息; A conversion and synthesis unit, used to convert the sign language information into speech information when it is determined that there is sign language information in at least two channels of decoded information in the decoding unit, and perform speech synthesis processing on the converted speech information to generate a synthesis voice message;
混音处理单元, 用于将所述转换合成单元生成的合成语音信息同其他解码 后的音频信息进行混音处理; A mixing processing unit, used for mixing the synthesized speech information generated by the conversion synthesis unit with other decoded audio information;
发送单元, 用于混音处理单元中混音之后的音频信息发送给所述至少两个 会场。 A sending unit, configured to send the audio information after mixing in the mixing processing unit to the at least two venues.
8、根据权利要求 7所述的视频会议中的视音频信息的处理装置, 其特征在 于, 所述解码单元还用于, 接收用户输入的或者所述会议终端发送的用户属性 信息; 8. The device for processing audio and video information in a video conference according to claim 7, wherein the decoding unit is further configured to receive user attribute information input by a user or sent by the conference terminal;
所述转换合成单元具体用于, 根据所述用户属性信息, 对所述转换后的语 音信息进行语音合成处理, 生成与所述用户属性信息匹配的合成语音信息。 The conversion and synthesis unit is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.
9、 根据权利要求 7 所述的视频会议中的视音频信息的处理装置, 其特征 在于, 所述装置还包括, 9. The device for processing audio and video information in a video conference according to claim 7, characterized by That is, the device also includes,
判断单元, 用于判断发送所述手语信息的所述会议终端所在的会场个数是 否超过第一阈值, 并将判断结果发送给所述转换合成单元; A judgment unit, configured to judge whether the number of conference venues where the conference terminal sending the sign language information is located exceeds a first threshold, and send the judgment result to the conversion synthesis unit;
所述转换合成单元具体用于, 当接收到所述判断单元判断所述会场个数不 超过所述第一阈值的判断结果时,将转换后的所述语音信息进行语音合成处理, 生成合成语音信息; The conversion and synthesis unit is specifically configured to, when receiving the judgment result from the judgment unit that the number of conference venues does not exceed the first threshold, perform speech synthesis processing on the converted speech information to generate a synthesized speech. information;
所述转换合成单元具体用于, 当接收到所述判断单元判断所述会场个数超 过所述第一阈值的判断结果时, 将不超过所述第一阈值的转换后的所述语音信 息进行语音合成处理, 生成合成语音信息。 The conversion and synthesis unit is specifically configured to: when receiving a judgment result from the judgment unit that the number of conference venues exceeds the first threshold, convert the converted voice information that does not exceed the first threshold. Speech synthesis processing generates synthesized speech information.
10、 根据权利要求 7所述的视频会议中的视音频信息的处理装置, 其特征 在于, 所述转换合成单元还用于, 记录解码后得到的每个所述手语信息的第一 时刻值; 10. The device for processing audio and video information in a video conference according to claim 7, wherein the conversion and synthesis unit is further configured to record the first time value of each sign language information obtained after decoding;
所述混音处理单元具体用于: The mixing processing unit is specifically used for:
对所述第一时刻值进行从大至小排序; Sort the first time values from large to small;
根据所述第一时刻值的排序, 将所述生成的合成语音信息按照预设的增益 系数进行增益放大; According to the sorting of the first time values, the generated synthesized speech information is gain amplified according to a preset gain coefficient;
计算所述其他解码后的音频信息的能量值,按照所述能量值从大至小排序, 将所述其他解码后的音频信息的增益系数进行增益放大; Calculate the energy values of the other decoded audio information, sort the energy values from large to small, and gain amplify the gain coefficients of the other decoded audio information;
对增益处理后的所述生成的合成语音信息和所述其他解码后的音频信息进 行混音处理, 将混音之后的音频信息发送给所述至少两个会场。 Perform mixing processing on the generated synthesized speech information after gain processing and the other decoded audio information, and send the mixed audio information to the at least two conference venues.
11、根据权利要求 1 0所述的视频会议中的视音频信息的处理装置,其特征 在于, 所述装置还包括: 11. The device for processing audio and video information in a video conference according to claim 10, characterized in that, the device further includes:
手语转换单元, 用于将参与混音处理的所述其他解码后的音频信息转换为 手语信息; A sign language conversion unit, used to convert the other decoded audio information participating in the mixing process into sign language information;
缩放处理单元, 用于根据所述计算出的所述其他解码后的音频信息的能量 值, 按照预设的比例将所述转换后的所述手语信息进行缩放处理; A scaling processing unit, configured to scale the converted sign language information according to a preset ratio according to the calculated energy value of the other decoded audio information;
叠加单元, 用于将经过缩放处理后的所述手语信息与所述会议终端中的当 前图像进行叠加, 用以在至少两个会场显示。 An overlay unit, configured to overlay the scaled sign language information with the current image in the conference terminal for display in at least two conference venues.
12、 根据权利要求 7所述的视频会议中的视音频信息的处理装置, 其特征 在于, 所述装置还包括: 12. The device for processing audio and video information in a video conference according to claim 7, characterized in that the device further includes:
文本转换单元, 用于将所述手语信息转换为文本信息, 并将所述文本信息 与所述会议终端中的当前图像进行叠加处理, 用以在至少两个会场显示。 A text conversion unit, configured to convert the sign language information into text information, and superimpose the text information with the current image in the conference terminal for display in at least two conference venues.
1 3、 一种视频会议系统, 其特征在于, 所述系统包括: 至少两个会场, 所 述每一个会场至少包括一个会议终端,以及如权利要求 8至权利要求 14中任一 权项所述的视频会议中视音频信息的处理装置。 13. A video conference system, characterized in that the system includes: at least two conference venues, so Each of the conference sites includes at least one conference terminal, and a device for processing video and audio information in a video conference as described in any one of claims 8 to 14.
PCT/CN2013/083170 2012-12-21 2013-09-10 Method, device and system for processing video/audio information in video conference WO2014094461A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210560387.7 2012-12-21
CN201210560387.7A CN102984496B (en) 2012-12-21 2012-12-21 The processing method of the audiovisual information in video conference, Apparatus and system

Publications (1)

Publication Number Publication Date
WO2014094461A1 true WO2014094461A1 (en) 2014-06-26

Family

ID=47858189

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/083170 WO2014094461A1 (en) 2012-12-21 2013-09-10 Method, device and system for processing video/audio information in video conference

Country Status (2)

Country Link
CN (1) CN102984496B (en)
WO (1) WO2014094461A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984496B (en) * 2012-12-21 2015-08-19 华为技术有限公司 The processing method of the audiovisual information in video conference, Apparatus and system
CN103686059B (en) * 2013-09-23 2017-04-05 广东威创视讯科技股份有限公司 Distributed mixed audio processing method and system
CN107566863A (en) * 2016-06-30 2018-01-09 中兴通讯股份有限公司 A kind of exchange of information methods of exhibiting, device and equipment, set top box
CN108881783B (en) * 2017-05-09 2020-09-08 腾讯科技(深圳)有限公司 Method and device for realizing multi-person conversation, computer equipment and storage medium
US11064000B2 (en) * 2017-11-29 2021-07-13 Adobe Inc. Accessible audio switching for client devices in an online conference
CN110401623A (en) * 2018-04-25 2019-11-01 中国移动通信有限公司研究院 A kind of multiside calling method, platform, terminal, medium, equipment and system
CN110555329A (en) * 2018-05-31 2019-12-10 苏州欧力机器人有限公司 Sign language translation method, terminal and storage medium
CN110083250A (en) * 2019-05-14 2019-08-02 长沙手之声信息科技有限公司 A kind of accessible conference system for supporting sign language translation on line
KR102178175B1 (en) * 2019-12-09 2020-11-12 김경철 User device and method of controlling thereof
KR102178174B1 (en) * 2019-12-09 2020-11-12 김경철 User device, broadcasting device, broadcasting system and method of controlling thereof
CN113660449B (en) * 2021-10-20 2022-03-01 中兴通讯股份有限公司 Gesture communication method and device, storage medium and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101080000A (en) * 2007-07-17 2007-11-28 华为技术有限公司 Method, system, server and terminal for displaying speaker in video conference
CN101115088A (en) * 2007-08-07 2008-01-30 周运南 Mobile phone dedicated for deaf-mutes
CN101309390A (en) * 2007-05-17 2008-11-19 华为技术有限公司 Visual communication system, apparatus and subtitle displaying method
CN101594434A (en) * 2009-06-16 2009-12-02 中兴通讯股份有限公司 The sign language processing method and the sign language processing mobile terminal of portable terminal
CN102387338A (en) * 2010-09-03 2012-03-21 中兴通讯股份有限公司 Distributed type video processing method and video session system
CN102984496A (en) * 2012-12-21 2013-03-20 华为技术有限公司 Processing method, device and system of video and audio information in video conference

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179693B (en) * 2007-09-26 2011-02-02 深圳市迪威视讯股份有限公司 Mixed audio processing method of session television system
JP5284360B2 (en) * 2007-09-26 2013-09-11 フラウンホッファー−ゲゼルシャフト ツァ フェルダールング デァ アンゲヴァンテン フォアシュンク エー.ファオ Apparatus and method for extracting ambient signal in apparatus and method for obtaining weighting coefficient for extracting ambient signal, and computer program
CN101453611A (en) * 2007-12-07 2009-06-10 希姆通信息技术(上海)有限公司 Method for video communication between the deaf and the normal
KR20110100649A (en) * 2008-12-15 2011-09-14 코닌클리케 필립스 일렉트로닉스 엔.브이. Method and apparatus for synthesizing speech
US8447023B2 (en) * 2010-02-01 2013-05-21 Polycom, Inc. Automatic audio priority designation during conference

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101309390A (en) * 2007-05-17 2008-11-19 华为技术有限公司 Visual communication system, apparatus and subtitle displaying method
CN101080000A (en) * 2007-07-17 2007-11-28 华为技术有限公司 Method, system, server and terminal for displaying speaker in video conference
CN101115088A (en) * 2007-08-07 2008-01-30 周运南 Mobile phone dedicated for deaf-mutes
CN101594434A (en) * 2009-06-16 2009-12-02 中兴通讯股份有限公司 The sign language processing method and the sign language processing mobile terminal of portable terminal
CN102387338A (en) * 2010-09-03 2012-03-21 中兴通讯股份有限公司 Distributed type video processing method and video session system
CN102984496A (en) * 2012-12-21 2013-03-20 华为技术有限公司 Processing method, device and system of video and audio information in video conference

Also Published As

Publication number Publication date
CN102984496B (en) 2015-08-19
CN102984496A (en) 2013-03-20

Similar Documents

Publication Publication Date Title
WO2014094461A1 (en) Method, device and system for processing video/audio information in video conference
EP2439945B1 (en) Audio panning in a multi-participant video conference
US8705778B2 (en) Method and apparatus for generating and playing audio signals, and system for processing audio signals
US20100039498A1 (en) Caption display method, video communication system and device
EP2566194A1 (en) Method and device for processing audio in video communication
WO2012072008A1 (en) Method and device for superposing auxiliary information of video signal
US11782674B2 (en) Centrally controlling communication at a venue
WO2011057511A1 (en) Method, apparatus and system for implementing audio mixing
JP2001268078A (en) Communication controller, its method, providing medium and communication equipment
JP2003283672A (en) Conference call system
CN114598835A (en) System and method for displaying users participating in a communication session
JP2003339034A (en) Network conference system, network conference method, and network conference program
US11764984B2 (en) Teleconference method and teleconference system
CN216122672U (en) Conference system and remote conference platform
Kilgore et al. The Vocal Village: enhancing collaboration with spatialized audio
JP2003163906A (en) Television conference system and method therefor
CN113556503A (en) Conference system, remote conference platform and audio processing method
JP2570812Y2 (en) Multipoint video conference equipment
Kilgore The Vocal Village: A Spatialized Audioconferencing Tool for Collaboration at a Distance
Brandenburg et al. Personalized Auditory Reality Karlheinz Brandenburg1, 2, Estefanıa Cano Cerón2, Florian Klein1, Thomas Köllmer2, Hanna Lukashevich2, Annika Neidhardt1, Johannes Nowak1, 2, Ulrike Sloma1, Stephan Werner1
Jeong et al. Voice signal transmission and rendering system to enhance the realism of sound in telepresence system
JP2009212781A (en) Video conference system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13865919

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13865919

Country of ref document: EP

Kind code of ref document: A1