WO2014094461A1

WO2014094461A1 - Method, device and system for processing video/audio information in video conference

Info

Publication number: WO2014094461A1
Application number: PCT/CN2013/083170
Authority: WO
Inventors: 倪为
Original assignee: 华为技术有限公司
Priority date: 2012-12-21
Filing date: 2013-09-10
Publication date: 2014-06-26
Also published as: CN102984496B; CN102984496A

Abstract

The embodiments of the present invention relate to a method, device and system for processing video/audio information in a video conference.The method includes: receiving data code streams transmitted from at least two conference terminals, and decoding the data code streams to obtain at least two channels of decoded information; when determining that sign language information exists in the at least two channels of decoded information, converting the sign language into voice information, and performing voice synthesis on the converted voice information to generate synthetic voice information; performing audio mixing on the generated synthetic voice information with other decoded audio information; and transmitting the audio-mixed audio information to at least two conference sites.

Description

Method, device and system for processing video and audio information in video conference The application is submitted to the Chinese Patent Office on December 21, 2012, and the application number is

201210560387.7, the entire disclosure of which is incorporated herein by reference in its entirety in its entirety in the entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire all all all all all all all all each The present invention relates to the field of communications technologies, and in particular, to a method, device, and system for processing video and audio information in a video conference. BACKGROUND OF THE INVENTION With the development and progress of society, deaf-mute people are increasingly receiving attention and attention from society as a vulnerable group of society. In life and work, the exchanges between deaf and normal people have become more and more. With the popularity of modern education, sign language has become a common way of communicating between deaf and dumb people and communicating with normal people. However, sign language requires special training and learning. Normal people generally have relatively few people who have sign language in addition to special needs, which causes communication difficulties between normal people and deaf people.

At present, when deaf-mute people or deaf-mute people communicate with normal people remotely, they usually use special equipment or systems, such as using terminals that implement sign language, text, and voice conversion, or let third parties intervene in sign language/voice translation. Work to solve the problem of face-to-face communication between deaf-mute people or deaf-mute people and normal people.

As shown in Figure 1, Figure 1 is a schematic diagram of remote communication between deaf and dumb people. The deaf-mute A performs sign language expression, and uses the video communication terminal A (such as video phone, video conference terminal, desktop soft terminal, etc.) to collect the sign language image, and then transmits it to the video communication terminal B via the communication network, and the deaf-mute B passes the video. The presentation of the communication terminal B, seeing the sign language image of the deaf-mute A, understanding the meaning of the other party's expression. The reverse is also true, and the entire communication process is completed.

As shown in Fig. 2, Fig. 2 is a schematic diagram of remote communication between a deaf person and a normal person. The sign language image of the deaf person is collected by the video communication terminal A, and distributed to the video communication of the normal person and the translator through the multipoint control unit. Presented in the terminals B and C; after being interpreted by the translator, the translated voice is picked up by the video communication terminal, and distributed to the video communication terminal B where the normal person is located through the multi-point control unit, and the normal person passes the voice of the translator. Understand the content expressed by deaf people.

The normal person's voice is picked up by the video communication terminal B and distributed to the translator through the multipoint control unit. In the video communication terminal C, the translator translates it into sign language, the sign language image is collected by the video communication terminal, and distributed to the deaf and mute person through the multi-point control unit, and the deaf-mute person understands the normal person expression through the translator's sign language image. Content.

With the gradual increase in communication and communication between deaf-mute people and normal people, the prior art also exposes the following drawbacks: 1) When deaf-mute people communicate with normal people, each meeting requires the participation of third-party translators. Increase the labor cost of communication; 2) In a multi-party conference, if there are multiple deaf-mute people presenting sign language actions or multiple normal people speaking at the same time, the translator can not deal with this situation well, clearly expressed The content of each speaker. Therefore, the prior art has certain limitations and does not solve the problems faced by deaf-mute people in the multi-party conference exchange. SUMMARY OF THE INVENTION The object of the present invention is to solve the problem that the deaf-mute person can't communicate freely and effectively when participating in multi-party remote video conference in the prior art, and provides a method, device and system for processing video and audio information in a video conference.

In a first aspect, an embodiment of the present invention provides a method for processing video and audio information in a video conference, where the video conference includes at least two conference sites, and each conference site includes at least one conference terminal, and the method includes:

Receiving a data code stream sent by at least two conference terminals, and decoding the data code stream to obtain at least two pieces of decoding information;

Determining, when the sign language information exists in the at least two pieces of decoded information, converting the sign language information into voice information, and performing speech synthesis processing on the converted voice information to generate synthesized speech information; and generating the synthesized speech The information is mixed with other decoded audio information; the audio information after the mixing is sent to the at least two sites.

In a first possible implementation, the receiving the data code stream sent by the at least two conference terminals, and decoding the data code stream to obtain at least two pieces of decoding information, further includes:

Receiving user attribute information input by the user or sent by the conference terminal;

Performing voice synthesis processing on the converted voice information to generate synthesized voice information is specifically:

And performing voice synthesis processing on the converted voice information according to the user attribute information to generate synthesized voice information that matches the user attribute information.

In a second possible implementation manner, the performing voice synthesis on the converted voice information Processing, generating synthetic voice information is specifically:

Determining whether the number of sites where the conference terminal that sends the sign language information is located exceeds a first threshold;

If the first threshold is not exceeded, performing the voice synthesis processing on the converted voice information to generate synthesized voice information;

If the first threshold is exceeded, the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.

In a third possible implementation manner, before the converting the sign language information into the voice information, the method further includes: recording a first time value of each of the sign language information obtained after decoding;

The mixing and processing the generated synthesized speech information with other decoded audio information is:

Sorting the first time value from large to small;

And generating, according to the order of the first time values, the generated synthesized voice information according to a preset gain coefficient;

Calculating energy values of the other decoded audio information, sorting the energy values from large to small, and performing gain amplification on the gain coefficients of the other decoded audio information;

And mixing the generated synthesized speech information after the gain processing and the other decoded audio information, and transmitting the mixed audio information to the at least two conference sites.

With reference to the first aspect or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the method further includes:

Converting the other decoded audio information participating in the mixing process into sign language information;

And converting, according to a preset ratio, the converted sign language information according to an order of energy values of the other decoded audio information;

And adding the scaled processed sign language information to the current image in the conference terminal for display in at least two conference sites.

In a fifth possible implementation, the method further includes: converting the sign language information into text information, and superimposing the text information with a current image in the conference terminal, for at least two A venue is displayed.

In a second aspect, an embodiment of the present invention provides a processing device for video and audio information in a video conference, where the device includes:

a decoding unit, configured to receive a data code stream sent by at least two conference terminals, and use the data code The stream is decoded to obtain at least two pieces of decoded information;

a conversion synthesizing unit, configured to: when the sign language information exists in at least two pieces of decoding information in the decoding unit, convert the sign language information into voice information, and perform speech synthesis processing on the converted voice information to generate a composite voice message;

a mixing processing unit, configured to perform mixing processing on the synthesized voice information generated by the conversion combining unit and other decoded audio information;

And a sending unit, configured to send the audio information after the mixing in the mixing processing unit to the at least two conference sites.

In a first possible implementation manner, the decoding unit is further configured to: receive user attribute information that is input by a user or sent by the conference terminal;

The conversion synthesizing unit is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.

In a second possible implementation manner, the device further includes:

a determining unit, configured to determine whether the number of sites where the conference terminal that sends the sign language information is located exceeds a first threshold, and sends the determination result to the conversion synthesizing unit;

The conversion synthesizing unit is configured to: when receiving the determination result that the determining unit determines that the number of the meeting sites does not exceed the first threshold, performing the speech synthesis processing on the converted speech information to generate a synthesized speech Information

The conversion synthesizing unit is configured to: when receiving the determination result that the determining unit determines that the number of the site exceeds the first threshold, performing the converted voice information that does not exceed the first threshold Speech synthesis processing to generate synthesized speech information.

In a third possible implementation manner, the conversion synthesizing unit is further configured to: record a first time value of each of the sign language information obtained after decoding;

The mixing processing unit is specifically configured to: sort the first time value from large to small; according to the ordering of the first time value, perform the gain according to the preset gain coefficient according to a preset gain coefficient Zoom in

And mixing the generated synthesized speech information after the gain processing and the other decoded audio information, and transmitting the audio information after the mixing to the at least two conference sites.

In combination with the second aspect or the third possible implementation of the second aspect, in a fourth possible implementation In the mode, the device further includes:

a sign language conversion unit, configured to convert the other decoded audio information participating in the mixing process into sign language information;

a scaling processing unit, configured to perform scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to the preset ratio;

And a superimposing unit, configured to superimpose the scaled processed sign language information and the current image in the conference terminal to be displayed in at least two conference sites.

In a fifth possible implementation, the device further includes: a text conversion unit, configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal , used to display in at least two venues.

In a third aspect, an embodiment of the present invention provides a system for processing video and audio information in a video conference, where the system includes: at least two conference sites, each of the conference sites includes at least one conference terminal, and claim 8 A processing device for video and audio information in a video conference according to any of claims 14.

Therefore, by applying the method, device and system for processing video and audio information in a video conference provided by the embodiment of the present invention, the multipoint control unit decodes the data stream sent by the conference terminal, and when the decoded data information is sign language information, Converting the sign language information into voice information, and processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing At least two venues are provided, thereby solving the problem that the prior art deaf-mute people can not communicate freely and effectively when participating in multi-party remote video conferences. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic diagram of remote communication between deaf and dumb persons in the prior art;

2 is a schematic diagram of remote communication between a deaf person and a normal person in the prior art;

3 is a flow chart of a method for processing video and audio information in a video conference according to Embodiment 1 of the present invention;

4 is a schematic diagram of a system for processing video and audio information in a video conference according to an embodiment of the present invention; FIG. 5 is a schematic diagram of image overlay according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of image overlay according to an embodiment of the present invention;

FIG. 7 is a structural diagram of an apparatus for processing video and audio information in a video conference according to Embodiment 2 of the present invention; FIG. FIG. 8 is a structural diagram of an apparatus for processing video and audio information in a video conference according to Embodiment 3 of the present invention. DETAILED DESCRIPTION OF THE EMBODIMENTS In order to make the objects, technical solutions and advantages of the present invention more comprehensible, the embodiments of the present invention will be further described in detail below.

The following is a description of the method for processing information provided by the embodiment of the present invention, and FIG. 3 is a flowchart of a method for processing video and audio information in a video conference according to the embodiment of the present invention. Multi-point control server, the following is an example of a multi-point control unit (Mul t ipo int Control Uni t, MCU for short). As shown in FIG. 3, this embodiment includes the following steps:

Step 310: Receive a data code stream sent by at least two conference terminals, and decode the data code stream to obtain at least two pieces of decoding information.

Specifically, in a multi-party conference, as shown in FIG. 4, FIG. 4 is a schematic diagram of a system for processing video and audio information in a video conference according to an embodiment of the present invention, where each conference terminal exists in each conference site of a multi-party conference, and the video conference Including at least two conference sites (as shown in the figure, Figure 4 includes four conference sites, each conference site includes one conference terminal. It can be understood that the actual application is not limited to four conference sites), and each conference site includes at least one conference terminal, and the conference The terminal is configured to collect and output audio and video information in the conference site, and pick up original user information, where the original user information is specifically a user's sign language information, voice information, etc., and the conference terminal encodes the original user information to form a data stream, and The data stream is sent to the MCU, and the MCU receives the data stream sent by the conference terminal.

In the embodiment of the present invention, the conference terminal refers to a device that has the functions of acquiring an image, picking up a sound, and accepting an external input function, and is responsible for transmitting the acquired video image to a display for display, and transmitting the received audio information to the speaker. Play, for example, a video terminal.

After receiving the data code stream, the MCU decodes the data code stream to obtain at least two pieces of decoding information, where the decoding information includes sign language information collected by the conference terminal, the picked up audio information, and the like.

Step 320: Determine, when at least two pieces of decoding information, that there is sign language information, convert the sign language information into voice information, and perform voice synthesis processing on the converted voice information to generate synthesized voice information.

Specifically, after the MCU performs decoding, when the MCU determines that there is sign language information in at least two pieces of decoding information, the MCU converts the sign language information into voice information.

The MCU determines that the sign language information exists in at least two pieces of decoding information, and specifically, after the MCU performs decoding, According to the decoded data, when the MCU restores the decoded data to sign language information, the MCU converts the sign language information into voice information; when the MCU can restore the decoded data to audio information, the audio information is Mixing, or converting audio information into sign language information.

Further, the sign language information is a gesture action performed by any deaf person collected at the conference terminal. When the deaf-mute person needs to express his or her opinion in the conference, the deaf-mute person faces the conference terminal to perform sign language expression, the conference The terminal collects the sign language information of the deaf and dumb person, and sends the sign language information to the MCU. After the MCU obtains the sign language information, the MCU converts the sign language information into the voice information.

After converting the sign language information into voice information, the MCU synthesizes the converted voice information to generate synthesized voice information.

Performing, by the MCU, the voice synthesis processing on the converted voice information to generate synthesized voice information, specifically:

The MCU determines whether the number of sites in which the terminal that sends the sign language information is located exceeds a first threshold; if the first threshold is not exceeded, the converted voice information is subjected to voice synthesis processing to generate synthesized voice information; The first threshold performs speech synthesis processing on the converted speech information that does not exceed the first threshold to generate synthesized speech information.

In the embodiment of the present invention, the first threshold is the number of sites of the maximum mix that the MCU can bear, and the site of the general maximum mix is a quad mix.

Before the MCU performs the conversion of the sign language information into the voice information, the MCU also records the first time value of each of the sign language information obtained after the decoding. The MCU records the first time value for selecting the voice information to participate in the mixing process according to the first time value of each of the recorded sign language information during the subsequent mixing process.

Step 330: Perform mixing processing on the generated synthesized voice information through other decoded audio information.

Specifically, the MCU mixes the generated synthesized speech information with other decoded audio information, such that the users in the multi-party conference receive signals with satisfactory speech quality.

The MCU performs the mixing processing of the generated synthesized voice information with other decoded audio information as follows:

The MCU sorts the first time value in step 320 from large to small; according to the order of the first time value, the MCU performs the gain amplification according to the preset gain coefficient; and calculates other decoded The energy value of the audio information is sorted according to the energy value from large to small, and the gain coefficient of the other decoded audio information is gain-amplified; the synthesized speech signal generated by the MCU after the gain processing The audio information is mixed with other decoded audio information.

In the embodiment of the present invention, the number of pieces of audio information participating in the mixing process does not exceed the first threshold. Further, when the decoded data information in the decoded information does not include the audio information, the MCU performs a mixing process on the plurality of synthesized voice information generated, and the plurality of synthesized voice information participating in the mixing process does not exceed the first threshold.

Further, in the embodiment of the present invention, when the decoded data information in the decoding information includes sign language information and audio information, it indicates that there is both a deaf-mute expression and a normal person expression, and the MCU preferentially converts the sign language information. Synthesizing voice information for voice information, and mixing the synthesized voice information with other decoded audio information without exceeding the first threshold. When the synthesized voice information exceeds the first threshold, priority is only The synthesized voice information is mixed and the other decoded audio information is discarded, and the expression of the deaf-mute person is guaranteed to be prioritized, and the communication problem between the deaf-mute person and the normal person is solved.

Step 340: Send the audio information after the mixing to the at least two conference sites.

Specifically, after the MCU performs the mixing process with the other decoded audio information, the MCU sends the audio information after the mixing to the at least two sites, where the at least two sites include the data stream. The venue, and the venue where the data stream is not sent.

Therefore, by applying the method, device and system for processing video and audio information in a video conference provided by the embodiment of the present invention, the multipoint control unit decodes the data stream sent by the conference terminal, and when the decoded data information is sign language information, Converting the sign language information into voice information, and processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing At least two venues are provided, thereby solving the problem that the prior art deaf-mute people can not communicate freely and effectively when participating in multi-party remote video conferences.

Optionally, before the step 310, the embodiment of the present invention further includes: the step of the MCU receiving the user attribute information input by the user or sent by the conference terminal, and the MCU receiving the user attribute information, thereby generating and the user when performing the voice synthesis processing. The synthesized voice information matched by the attribute information makes the listener in the venue feel real when listening, and enhances the sense of communication presence.

The MCU receives user attribute information input by the user or sent by the conference terminal.

Specifically, before the start of the multi-party conference, the deaf-mute person may input its own attribute information into the MCU, where the user attribute information includes: gender, age, nationality, etc.; or, the deaf-mute person may input his or her attribute information into the MCU. The conference terminal in the site is sent to the MCU by the conference terminal.

The MCU performs voice synthesis processing on the converted voice information according to the user attribute information, Synthetic speech information that matches the user attribute information is generated.

Specifically, after converting the sign language information into the voice information, the MCU obtains the user attribute information corresponding to the sign language information according to the sign language information, and performs voice synthesis processing on the converted voice information according to the user attribute information, and generates the user attribute information. Matching synthesized voice information. For example, the sign language information collected by the conference terminal is made by a Chinese middle-aged male, and after the MCU converts the sign language information into the voice information, the user attribute information corresponding to the sign language information is obtained, and the user attribute information is obtained according to the user attribute information. Performing speech synthesis processing on the converted speech information to generate synthesized speech information matching the user attribute information, and the MCU also adjusts the speech rate, tone, etc. of the synthesized speech information according to the speed of the sign language information, so that other normal people of the venue While listening. Feel more real and enhance the sense of communication.

Optionally, after the step 340, the embodiment of the present invention further includes: converting the other decoded audio information involved in the mixing process into sign language information, and processing the converted sign language information, and the MCU executes the other by executing The step of converting the decoded audio information into sign language information and processing the converted sign language information makes the deaf-mute people in the venue express their own wishes and also clarify the willingness of the normal person to express, thereby making the realization better. Dumb people communicate freely and effectively with normal people.

The MCU converts the other decoded audio information participating in the mixing process into sign language information.

Specifically, the MCU converts other decoded audio information participating in the mixing process into sign language information, and does not convert other decoded audio information that does not participate in the mixing process.

The MCU performs scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to a preset ratio.

Specifically, the MCU performs scaling processing on the converted sign language information according to a preset ratio according to the order of the energy values of the other decoded audio information in step 330. Further, the image corresponding to the converted sign language information having a large energy value is subjected to enlargement processing, and the image corresponding to the sign language information after the converted energy value is reduced.

The MCU superimposes the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.

Specifically, using the superimposition mode or the multi-picture mode, the MCU superimposes the scaled signed sign language information with the current image in the conference terminal for display in at least two conference sites.

The superimposition mode specifically refers to superimposing a plurality of sign language information on the current image in the conference terminal and presenting it in the display screen of the conference terminal, and the superposition mode will block the current image in the conference terminal, as shown in FIG. 5 . The multi-picture mode specifically means that a plurality of sign language information and a current image in the conference terminal are presented together in a display screen of the conference terminal, and the multi-picture mode does not block the conference. The current image in the terminal, as shown in Figure 6.

For example, in the superimposition mode shown in FIG. 5, and in FIG. 5, the image of each site is obtained by converting audio information, and the image in the site 2 is converted to sign language because the corresponding audio information has the largest energy value. After the information, the image is enlarged. The image of the site 3 and the site 4 has the smallest energy value of the corresponding audio information. Therefore, after being converted into the sign language information, the image is reduced. For another example, as shown in the multi-picture mode of FIG. 6, the same is not repeated here.

Through the above-mentioned enlargement or reduction processing of the converted sign language information, the deaf-mute person in the venue selectively views the will of the normal person in the plurality of venues while watching the will of the normal person, and according to the normal person The audio information, the focus of watching the willingness of normal people to express, so that the deaf-mute and the normal person can communicate freely and effectively.

Preferably, after step 340, the embodiment of the present invention further includes the step of converting the sign language information into the text information, and performing the step of converting the sign language information into the text information, thereby facilitating communication of the multi-party conference, especially when the sign language information is not in the When the main screen of the venue is played, text information can be used to support the communication of the conference.

The MCU converts the sign language information into text information, and superimposes the text information with the current image in the conference terminal for display in at least two venues.

Specifically, the MCU converts the sign language information into text information, and after the conversion, performs image superimposition processing on the text information and the current image in the conference terminal. The sign language information is converted into text information in the form of subtitles. When the sign language information of the deaf person is not played in the main screen of the venue, the text information can be used to assist the communication of the conference.

In order to make the objects, the technical solutions and the advantages of the present invention more comprehensible, the specific embodiments of the present invention are further described in detail below.

The following describes a method for processing information provided by an embodiment of the present invention by using a specific embodiment as an example. For example, in a multi-party conference, the deaf-mute people at the conference site 1 and the conference site 2 perform sign language expression. The normal people at the conference site 3 and the conference site 4 express the voice through the conference terminals at the conference site 1 and the conference site 2, and collect the sign language information of the deaf and mute. And sent to the MCU, the user terminal belonging to the conference site 3 and the conference site 4 picks up the voice of the normal person, and after encoding, sends the voice to the MCU. After decoding by the MCU, the first sign language information of the site 1 and the second sign language information of the site 2 are obtained. Obtaining the first audio information of the site 3 and the second audio information of the site 4. The MCU records the first time value of each sign language information, and converts the sign language information into voice information, that is, the first sign language information is converted into the first voice. Information, the second sign language information is converted into the second voice information.

The number of sites in which the MCU explicitly sends the sign language information is 2, compared with the first threshold, and the number is preset. A threshold of 4 indicates that the maximum number of mixing sites that the MCU can withstand is four. The MCU performs voice synthesis processing on the first voice information and the second voice according to the previously received user attribute information, and generates a first synthesis that matches the user attribute information, because the number of the site that sends the sign language information does not exceed the first threshold. Voice information and second synthesized voice information.

The MCU sends the first time value to the first time value. In the embodiment of the present invention, the sign language information sent by the site 1 is sent before the site 2, and the MCU respectively presets the first synthesized voice information and the second synthesized voice information according to a preset gain. The gain is performed by the coefficient, for example, the gain coefficient of the first synthesized speech information is 1, and the gain coefficient of the second synthesized speech information is 0.8; the MCU also calculates the energy value of the first audio information and the second audio information, according to the energy value. In the embodiment of the present invention, the energy value of the first audio information is greater than the energy value of the second audio information, and the gain coefficients of the first audio information and the second audio information are gain-amplified, for example, the first audio information. Gain amplification is 0.8. The second audio information gain is amplified by 0.6. The MCU uses the mixing mode to perform mixing processing on the synthesized speech information and audio information generated after the gain processing.

After the MCU performs the mixing process, the audio information after the mixing is sent to at least two sites. At the same time, the MCU converts the first audio information and the second audio involved in the mixing process into the converted first sign language information and the converted second sign language information, and the MCU calculates the energy value according to the first audio information and the second audio information. Sorting, converting the converted sign language information according to a preset ratio, for example, in the embodiment of the present invention, since the energy value of the first audio information is large, the converted first sign language information is amplified, The converted second sign language information is reduced. Using the superimposition mode or the multi-picture mode, the MCU superimposes the scaled first sign language information and the second sign language information with the current image in the conference terminal for display in at least two conference sites.

Further, the MCU may further convert the first sign language information and the second sign language information into text information, and after the converting, perform image superimposition processing on the first text information and the second text information and the current image in the conference terminal. The first sign language information and the second sign language information are converted into the first text information and the second text information in the form of subtitles, when the first sign language information and the second sign language information of the deaf person are not played in the main screen of the venue, The first text information and the second text information can be used to assist in the communication of the meeting.

Correspondingly, the second embodiment of the present invention further provides a device for processing video and audio information in a video conference, which is used to implement the method for processing video and audio information in the video conference in the first embodiment, as shown in FIG. The processing device of the information includes: a decoding unit 710, a conversion synthesizing unit 720, a mixing processing unit 730, and a transmitting unit 740.

The decoding unit 710 is configured to receive a data code stream sent by at least two conference terminals, and decode the data code stream to obtain at least two pieces of decoding information. The conversion synthesizing unit 720 is configured to: when the sign language information is included in the at least two pieces of decoding information in the decoding unit, convert the sign language information into voice information, and perform speech synthesis processing on the converted voice information to generate Synthesizing voice information;

The mixing processing unit 730 is configured to perform mixing processing on the synthesized voice information generated by the conversion combining unit and other decoded audio information.

The sending unit 740 is configured to send the audio information after the mixing in the mixing processing unit to the at least two conference sites.

The decoding unit 710 is further configured to: receive user attribute information that is input by the user or sent by the conference terminal;

The conversion synthesizing unit 720 is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.

The device further includes: a determining unit 750, configured to determine whether the number of the site where the conference terminal that sends the sign language information is greater than a first threshold, and send the determination result to the conversion synthesis unit;

The conversion synthesizing unit 720 is specifically configured to, when receiving the determination result that the determining unit determines that the number of the conference sites does not exceed the first threshold, perform voice synthesis processing on the converted voice information to generate a composite voice message;

The conversion synthesizing unit 720 is specifically configured to: when the determining unit determines that the number of the conference sites exceeds the first threshold, the converted voice information that does not exceed the first threshold The speech synthesis process is performed to generate synthesized speech information.

The conversion synthesizing unit 720 is further configured to record a first time value of each of the sign language information obtained after decoding;

The mixing processing unit 730 is specifically configured to: sort the first time value from large to small; according to the ordering of the first time value, perform the generated synthesized voice information according to a preset gain coefficient. Gain amplification

The device further includes: a sign language conversion unit 760, configured to convert the other decoded audio information participating in the mixing process into sign language information; The scaling processing unit 770 is configured to perform scaling processing on the converted sign language information according to the calculated energy value of the other decoded audio information according to a preset ratio;

The superimposing unit 780 is configured to superimpose the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.

The device further includes: a text conversion unit 790, configured to convert the sign language information into text information, and superimpose the text information with a current image in the conference terminal for displaying in at least two conference sites .

The processing device that applies the video and audio information in the video conference provided by the embodiment of the present invention, after the processing device decodes the data code stream sent by the conference terminal, and converts the sign language information into the voice information when the decoded data information is the sign language information. And processing the converted voice information to generate synthesized voice information; mixing the generated synthesized voice information with other decoded audio information; and transmitting the audio information after the mixing to at least two venues, thereby solving In the prior art, when deaf and dumb people participate in multi-party remote video conferences, they are unable to communicate freely and effectively.

In addition, the apparatus for processing video and audio information in the video conference provided by the second embodiment of the present invention may be implemented as follows to implement the method for processing video and audio information in the video conference in the first embodiment of the present invention, as shown in FIG. 8. As shown, the processing device of the information includes: a network interface 810, a processor 820, and a memory 830. System bus 840 is used to connect network interface 810, processor 820, and memory 830.

Network interface 810 is used to communicate with the conference terminal.

Memory 830 can be a persistent storage, such as a hard drive and flash memory, with software modules and device drivers in memory 830. The software modules are capable of executing the various functional modules of the above described method of the present invention; the device drivers can be network and interface drivers.

At startup, these software components are loaded into memory 830, then accessed by processor 820 and executed as follows:

Further, after the processor accesses the software component of the memory 830, the data stream sent by the at least two conference terminals is received, and the data stream is decoded to obtain at least two channels of decoding. Execute the instructions of the following procedure before the information instruction:

Further, after the processor accesses the software component of the memory 830, performing a voice synthesis process on the converted voice information to generate a specific instruction of the process of synthesizing the voice information:

Further, after the processor accesses the software component of the memory 830, executing the following process before executing the conversion of the sign language information into the voice information instruction:

Recording a first time value of each of the sign language information obtained after decoding;

A specific instruction for performing a mixing process of synthesizing the generated synthesized speech information with other decoded audio information is performed:

Sorting the first time value from large to small;

Further, after the processor accesses the software component of the memory 830, the instructions of the following process are executed:

And performing, according to a preset ratio, the converted sign language information according to the order of the energy values of the other decoded audio information; And superimposing the scaled processed sign language information with the current image in the conference terminal for display in at least two conference sites.

Converting the sign language information into text information, and superimposing the text information with a current image in the conference terminal for display in at least two conference sites.

A person skilled in the art should further appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software or a combination of both, in order to clearly illustrate hardware and software. Interchangeability, the composition and steps of the various examples have been generally described in terms of function in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

The above described embodiments of the present invention are further described in detail, and the embodiments of the present invention are intended to be illustrative only. The scope of the protection, any modifications, equivalents, improvements, etc., made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

Rights request

1. A method for processing audio and video information in a video conference, characterized in that the video conference includes at least two conference sites, each of the conference sites includes at least one conference terminal, and the method includes: receiving at least two conference sites. The data code stream sent by the terminal is decoded to obtain at least two channels of decoded information;

When it is determined that sign language information exists in the at least two channels of decoded information, the sign language information is converted into speech information, and the converted speech information is subjected to speech synthesis processing to generate synthetic speech information; and the generated synthetic speech is The information is mixed with other decoded audio information; and the mixed audio information is sent to the at least two venues.

2. The method of processing audio and video information in a video conference according to claim 1, characterized in that: receiving data code streams sent by at least two conference terminals, and decoding the data code streams to obtain at least The two-way decoding information also includes:

Receive user attribute information input by the user or sent by the conference terminal;

The converted speech information is subjected to speech synthesis processing to generate synthesized speech information specifically as follows:

According to the user attribute information, speech synthesis processing is performed on the converted speech information to generate synthesized speech information that matches the user attribute information.

3. The method of processing audio and video information in a video conference according to claim 1, characterized in that, performing speech synthesis processing on the converted speech information, and generating the synthesized speech information is specifically:

Determine whether the number of conference venues where the conference terminal sending the sign language information is located exceeds a first threshold;

If the first threshold is not exceeded, perform speech synthesis processing on the converted speech information to generate synthesized speech information;

If it exceeds the first threshold, the converted speech information that does not exceed the first threshold is subjected to speech synthesis processing to generate synthesized speech information.

4. The method of processing audio and video information in a video conference according to claim 1, characterized in that, before converting the sign language information into speech information, the method further includes: recording each of the sign language information obtained after decoding. The first moment value;

The mixing process of the generated synthetic speech information and other decoded audio information is specifically:

Sort the first time values from large to small; According to the sorting of the first time values, the generated synthesized speech information is gain amplified according to a preset gain coefficient;

Calculate the energy values of the other decoded audio information, sort the energy values from large to small, and gain amplify the gain coefficients of the other decoded audio information;

Perform mixing processing on the generated synthesized speech information after gain processing and the other decoded audio information, and send the mixed audio information to the at least two conference venues.

5. The method for processing audio and video information in a video conference according to claim 4, characterized in that the method further includes:

Convert the other decoded audio information participating in the mixing process into sign language information;

According to the sorting of the energy values of the other decoded audio information, the converted sign language information is scaled according to a preset ratio;

The scaled sign language information is superimposed on the current image in the conference terminal for display in at least two conference sites.

6. The method for processing audio and video information in a video conference according to claim 1, characterized in that the method further includes:

Convert the sign language information into text information, and superimpose the text information with the current image in the conference terminal for display in at least two conference venues.

7. A device for processing audio and video information in a video conference, characterized in that the device includes: a decoding unit, configured to receive data streams sent by at least two conference terminals and decode the data streams , obtain at least two channels of decoding information;

A conversion and synthesis unit, used to convert the sign language information into speech information when it is determined that there is sign language information in at least two channels of decoded information in the decoding unit, and perform speech synthesis processing on the converted speech information to generate a synthesis voice message;

A mixing processing unit, used for mixing the synthesized speech information generated by the conversion synthesis unit with other decoded audio information;

A sending unit, configured to send the audio information after mixing in the mixing processing unit to the at least two venues.

8. The device for processing audio and video information in a video conference according to claim 7, wherein the decoding unit is further configured to receive user attribute information input by a user or sent by the conference terminal;

The conversion and synthesis unit is specifically configured to perform speech synthesis processing on the converted speech information according to the user attribute information, and generate synthesized speech information that matches the user attribute information.

9. The device for processing audio and video information in a video conference according to claim 7, characterized by That is, the device also includes,

A judgment unit, configured to judge whether the number of conference venues where the conference terminal sending the sign language information is located exceeds a first threshold, and send the judgment result to the conversion synthesis unit;

The conversion and synthesis unit is specifically configured to, when receiving the judgment result from the judgment unit that the number of conference venues does not exceed the first threshold, perform speech synthesis processing on the converted speech information to generate a synthesized speech. information;

The conversion and synthesis unit is specifically configured to: when receiving a judgment result from the judgment unit that the number of conference venues exceeds the first threshold, convert the converted voice information that does not exceed the first threshold. Speech synthesis processing generates synthesized speech information.

10. The device for processing audio and video information in a video conference according to claim 7, wherein the conversion and synthesis unit is further configured to record the first time value of each sign language information obtained after decoding;

The mixing processing unit is specifically used for:

Sort the first time values from large to small;

According to the sorting of the first time values, the generated synthesized speech information is gain amplified according to a preset gain coefficient;

11. The device for processing audio and video information in a video conference according to claim 10, characterized in that, the device further includes:

A sign language conversion unit, used to convert the other decoded audio information participating in the mixing process into sign language information;

A scaling processing unit, configured to scale the converted sign language information according to a preset ratio according to the calculated energy value of the other decoded audio information;

An overlay unit, configured to overlay the scaled sign language information with the current image in the conference terminal for display in at least two conference venues.

12. The device for processing audio and video information in a video conference according to claim 7, characterized in that the device further includes:

A text conversion unit, configured to convert the sign language information into text information, and superimpose the text information with the current image in the conference terminal for display in at least two conference venues.

13. A video conference system, characterized in that the system includes: at least two conference venues, so Each of the conference sites includes at least one conference terminal, and a device for processing video and audio information in a video conference as described in any one of claims 8 to 14.