CN102984496A

CN102984496A - Processing method, device and system of video and audio information in video conference

Info

Publication number: CN102984496A
Application number: CN2012105603877A
Authority: CN
Inventors: 倪为
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2012-12-21
Filing date: 2012-12-21
Publication date: 2013-03-20
Anticipated expiration: 2032-12-21
Also published as: WO2014094461A1; CN102984496B

Abstract

The embodiment of the invention relates to a processing method, device and system of video and audio information in a video conference. The method comprises the steps of receiving data code streams sent by at least two conference terminals, decoding the data code streams, obtaining at least two paths of decoding information, converting sign language information into sound information after confirming that the sign language information exists in the decoding information, conducting voice synthesis processing on the converted voice information, generating synthetic voice information, conducting sound-mixing processing on the generated synthetic voice information and other decoded audio information, and sending the audio information after sound mixing to at least two conference places.

Description

The processing method of the audiovisual information in the video conference, Apparatus and system

Technical field

The present invention relates to communication technique field, relate in particular to processing method, the Apparatus and system of the audiovisual information in a kind of video conference.

Background technology

Along with the development and progress of society, deaf and dumb personage more and more receives attention and the concern of society as the disadvantaged group of society.In life and work, exchange mutually between deaf and dumb personage and the normal person and also become more and more.Along with popularizing of modern education, sign language has become between the deaf-mute and has exchanged, and a kind of general mode that exchanges with the normal person.But sign language need to be carried out special training and study, and generally except special requirement, the personage who grasps sign language is relatively less, causes normal person and deaf and dumb personage's communication disorders for the normal person.

At present, between the deaf and dumb personage or when deaf and dumb personage and normal person's remote communication, the equipment of general using special use or system realize, as utilize the terminal that realizes sign language, literal, speech conversion, perhaps allow the intertranslation work of third-party involvement sign language/voice, solve between the deaf and dumb personage or the problem of deaf and dumb personage and normal person's face-to-face exchange with this.

As shown in Figure 1, Fig. 1 is the schematic diagram that carries out remote communication between the deaf and dumb personage.Deaf-mute A carries out sign language to express, utilize video communication terminal A (such as visual telephone, the video conference terminal, desktop software terminal etc.) collect its sign language image after, be transferred to video communication terminal B through communication network, deaf-mute B presents by video communication terminal B's, sees the sign language image of deaf-mute A, understands the meaning that the other side expresses.Vice versa, and then finish whole communication process.

As shown in Figure 2, Fig. 2 is the schematic diagram that deaf and dumb personage and normal person carry out remote communication, and deaf-mute's sign language image is gathered by video communication terminal A, is distributed to through multipoint control unit among video communication terminal B, the C at normal person and translator place to present; After the translator understands, it is translated into voice picked up by the video communication terminal, and be distributed to by multipoint control unit among the video communication terminal B at normal person place, the normal person has understood the content that the deaf-mute expresses by translator's voice.

Normal person's voice are picked up by video communication terminal B, and be distributed to by multipoint control unit among the video communication terminal C at translator place, the translator translates into sign language with it, the sign language image gathers through the video communication terminal, be distributed to the deaf-mute by multipoint control unit, the deaf-mute has understood the content that the normal person expresses by translator's sign language image.

Along with the gradually growth of exchange and conmmunication between deaf and dumb personage and the normal person, prior art also exposes following drawback: when 1) deaf and dumb personage exchanged with the normal person, each meeting all needed third party translator's participation, had increased the human cost of linking up; 2) in Multi-Party Conference, if there are a plurality of deaf and dumb personages to show when sign language is moved or a plurality of normal person talks simultaneously, the translator can't well process for this situation, to give expression to clearly each teller's content.Therefore, there is certain limitation in prior art, and not having to solve has deaf and dumb personage to participate in the problem that the Multi-Party Conference interchange faces.

Summary of the invention

The objective of the invention is in order to solve in the prior art deaf and dumb personage when participating in many ways Remote Video Conference, the problem that can't freely effectively exchange provides processing method, the equipment and system of the audiovisual information in a kind of video conference.

In first aspect, the embodiment of the invention provides the processing method of the audiovisual information in a kind of video conference, and described video conference comprises at least two meeting-place, and described each meeting-place comprises a conference terminal at least, and described method comprises:

Receive the data code flow that at least two conference terminals send, and described data code flow is decoded, obtain at least two-way decoded information;

When determining to have sign language information in the described at least two-way decoded information, described sign language information is converted to voice messaging, and the voice messaging after the described conversion is carried out phonetic synthesis process, generate synthetic speech information;

The synthetic speech information of described generation is carried out stereo process with other decoded audio-frequency informations;

Audio-frequency information after the audio mixing is sent to described at least two meeting-place.

In the possible implementation of the first, the data code flow that at least two conference terminals of described reception send, and described data code flow decoded, obtain also comprising before the two-way decoded information at least:

Receive customer attribute information user's input or that described conference terminal sends;

Describedly voice messaging after the described conversion carried out phonetic synthesis process, generate synthetic speech information and be specially:

According to described customer attribute information, the voice messaging after the described conversion is carried out phonetic synthesis process, generate the synthetic speech information with described customer attribute information coupling.

In the possible implementation of the second, describedly voice messaging after the described conversion is carried out phonetic synthesis process, generate synthetic speech information and be specially:

Whether the meeting-place number of judging the described conference terminal place that sends described sign language information surpasses first threshold;

If be no more than described first threshold, then the described voice messaging after the described conversion is carried out phonetic synthesis and process, generate synthetic speech information;

If surpass described first threshold, then will be no more than described voice messaging after the described conversion of described first threshold and carry out phonetic synthesis and process, generate synthetic speech information.

In the third possible implementation, described described sign language information is converted to also comprises before the voice messaging: the first value constantly of the described sign language information of each that obtains behind the carrying recorded decoding;

Described synthetic speech information with described generation is carried out stereo process with other decoded audio-frequency informations and is specially:

Described the first moment value is carried out from large to little ordering;

According to the described first constantly ordering of value, with the synthetic speech information of described generation according to the amplification that gains of default gain coefficient;

Calculate the energy value of described other decoded audio-frequency informations, according to described energy value from large to little ordering, with the amplification that gains of the gain coefficient of described other decoded audio-frequency informations;

Synthetic speech information and described other decoded audio-frequency informations to the described generation after the gain process carry out stereo process, and the audio-frequency information behind the audio mixing is sent to described at least two meeting-place.

In conjunction with the third possible implementation of first aspect or first aspect, in the 4th kind of possible implementation, described method also comprises:

Described other the decoded audio-frequency informations that participate in stereo process are converted to sign language information;

According to the ordering of the energy value of described other decoded audio-frequency informations, carry out convergent-divergent according to the described sign language information of default ratio after with described conversion and process;

Described sign language information after will processing through convergent-divergent and the present image in the described conference terminal superpose, in order to show at least two meeting-place.

In the 5th kind of possible implementation, described method also comprises: described sign language information is converted to text message, and the present image in described text message and the described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

In second aspect, the embodiment of the invention provides the processing unit of the audiovisual information in a kind of video conference, and described device comprises:

Decoding unit is used for receiving the data code flow that at least two conference terminals send, and described data code flow is decoded, and obtains at least two-way decoded information;

The conversion synthesis unit when having sign language information at least two-way decoded information for definite described decoding unit, is converted to voice messaging with described sign language information, and the voice messaging after the described conversion is carried out phonetic synthesis process, and generates synthetic speech information;

The stereo process unit is used for the synthetic speech information that described conversion synthesis unit generates is carried out stereo process with other decoded audio-frequency informations;

Transmitting element is used for stereo process unit audio mixing audio-frequency information afterwards and sends to described at least two meeting-place.

In the possible implementation of the first, described decoding unit also is used for, and receives customer attribute information user's input or that described conference terminal sends;

Described conversion synthesis unit specifically is used for, and according to described customer attribute information, the voice messaging after the described conversion is carried out phonetic synthesis process, and generates the synthetic speech information with described customer attribute information coupling.

In the possible implementation of the second, described device also comprises:

Judging unit is used for judging whether the meeting-place number at the described conference terminal place that sends described sign language information surpasses first threshold, and judged result is sent to described conversion synthesis unit;

Described conversion synthesis unit specifically is used for, and when receiving the described meeting-place of described judgment unit judges number and be no more than the judged result of described first threshold, the described voice messaging after the conversion is carried out phonetic synthesis process, and generates synthetic speech information;

Described conversion synthesis unit specifically is used for, when receiving the described meeting-place of described judgment unit judges number and surpass the judged result of described first threshold, carry out phonetic synthesis and process being no more than described voice messaging after the conversion of described first threshold, generate synthetic speech information.

Also be used for the first moment value of the described sign language information of each that obtains behind the carrying recorded decoding at the synthesis unit of conversion described in the third possible implementation;

Described stereo process unit specifically is used for: described the first moment value is carried out from large to little ordering;

Synthetic speech information and described other decoded audio-frequency informations to the described generation after the gain process carry out stereo process, and the audio-frequency information after the audio mixing is sent to described at least two meeting-place.

In conjunction with the third possible implementation of second aspect or second aspect, in the 4th kind of possible implementation, described device also comprises:

The sign language converting unit is converted to sign language information for described other decoded audio-frequency informations that will participate in stereo process;

The convergent-divergent processing unit is used for the energy value according to described described other the decoded audio-frequency informations that calculate, and carries out convergent-divergent according to the described sign language information of default ratio after with described conversion and processes;

Superpositing unit, the described sign language information after being used for processing through convergent-divergent and the present image of described conference terminal superpose, in order to show at least two meeting-place.

In the 5th kind of possible implementation, described device also comprises: the text-converted unit, be used for described sign language information is converted to text message, and the present image in described text message and the described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

In the third aspect, the embodiment of the invention provides the treatment system of the audiovisual information in a kind of video conference, described system comprises: at least two meeting-place, described each meeting-place comprises a conference terminal at least, and such as claim 8 processing unit of audiovisual information in the described video conference of arbitrary claim to the claim 14.

Therefore, by using processing method, the equipment and system of the audiovisual information in the video conference that the embodiment of the invention provides, after the data code flow that multipoint control unit sends the meeting terminal is decoded, when decoded data message is sign language information, sign language information is converted to voice messaging, and the voice messaging after the conversion is processed rear generation synthetic speech information; The synthetic speech information that generates is carried out stereo process with other decoded audio-frequency informations; Audio-frequency information after the audio mixing is sent at least two meeting-place, thereby solved in the prior art deaf and dumb personage when participating in many ways Remote Video Conference, the problem that can't freely effectively exchange.

Description of drawings

Fig. 1 is the schematic diagram that carries out remote communication in the prior art between the deaf and dumb personage;

Fig. 2 is the schematic diagram that deaf and dumb personage and normal person carry out remote communication in the prior art;

The process flow figure of the audiovisual information in the video conference that Fig. 3 provides for the embodiment of the invention one;

The treatment system schematic diagram of the audiovisual information in the video conference that Fig. 4 provides for the embodiment of the invention;

The image stack schematic diagram that Fig. 5 provides for the embodiment of the invention;

The image stack schematic diagram that Fig. 6 provides for the embodiment of the invention;

The processing unit structure chart of the audiovisual information in the video conference that Fig. 7 provides for the embodiment of the invention two;

The processing unit structure chart of the audiovisual information in the video conference that Fig. 8 provides for the embodiment of the invention three.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing the specific embodiment of the invention is described in further detail.

The below illustrates the processing method of the information that the embodiment of the invention provides as an example of Fig. 3 example, the process flow figure of the audiovisual information in the video conference that Fig. 3 provides for the embodiment of the invention one, subject of implementation is multipoint control server in embodiments of the present invention, (MultipointControl Unit is called for short: MCU) describe for example the below with multipoint control unit.As shown in Figure 3, this embodiment may further comprise the steps:

Step 310, receive the data code flow that at least two conference terminals send, and described data code flow is decoded, obtain at least two-way decoded information.

Particularly, in Multi-Party Conference, as shown in Figure 4, the treatment system schematic diagram of the audiovisual information in the video conference that Fig. 4 provides for the embodiment of the invention, each conference terminal is present in each meeting-place of Multi-Party Conference, and video conference comprises that at least two meeting-place are (as diagram, comprise 4 meeting-place among Fig. 4, each meeting-place comprises 1 conference terminal, be appreciated that and be not limited to 4 meeting-place in the practical application), each meeting-place comprises a conference terminal at least, conference terminal is used for gathering and exporting the audio/video information in meeting-place, pick up original user information, described original user information is specially user's sign language information, voice messaging etc., conference terminal is encoded original user information, forms data code flow, and data code flow is sent to MCU, MCU receives the data code flow that conference terminal sends.

In embodiments of the present invention, described conference terminal refers to have and gathers image, picks up sound, accepts the equipment of outside input function, and is responsible for that the video image that obtains is sent to display and shows, and the audio-frequency information of receiving is sent to loud speaker plays, for example, video terminal.

After MCU receives data code flow, the data code stream is decoded, obtain at least two-way decoded information, described decoded information comprises the sign language information that conference terminal gathers, the audio-frequency information that picks up etc.

When having sign language information in step 320, the definite at least two-way decoded information, described sign language information is converted to voice messaging, and the voice messaging after the described conversion is carried out phonetic synthesis process, generate synthetic speech information.

Particularly, after MCU decoded, when having sign language information in the definite at least two-way decoded information of MCU, then MCU was converted to voice messaging with sign language information.

Described MCU determines to exist sign language information to be specially in the two-way decoded information at least, after MCU decodes, reduces according to decoded data, and when MCU can be reduced to sign language information with decoded data, then MCU was converted to voice messaging with sign language information; When MCU can be reduced to audio-frequency information with decoded data, then audio-frequency information is carried out stereo process, perhaps audio-frequency information is converted to sign language information etc.

Further, the gesture motion that described sign language information is made for the arbitrary deaf-mute in the conference terminal collection, when deaf-mute in the meeting-place need to express the suggestion of self, then the deaf-mute expresses in the face of conference terminal carries out sign language, conference terminal gathers deaf-mute's sign language information, sends to MCU after encoded, and MCU is after decoding, obtain sign language information, then sign language information is converted to voice messaging.

After MCU is converted to voice messaging with sign language information, the voice messaging after the conversion is carried out phonetic synthesis, generate synthetic speech information.

The voice messaging of described MCU after to described conversion carries out phonetic synthesis to be processed, and generates synthetic speech information and is specially:

MCU judges whether the meeting-place number at the terminal place of discussing that sends sign language information surpasses first threshold; If be no more than described first threshold, the described voice messaging after then will changing carries out phonetic synthesis to be processed, and generates synthetic speech information; If surpass described first threshold, then will be no more than described voice messaging after the conversion of described first threshold and carry out phonetic synthesis and process, generate synthetic speech information.

In embodiments of the present invention, described first threshold is the meeting-place number of the MCU maximum audio mixing that can bear, and the meeting-place of general maximum audio mixing is cubic audio mixing.

Wherein, MCU carry out sign language information is converted to voice messaging before, the first value constantly of each sign language information that MCU also obtains behind the carrying recorded decoding.MCU records described the first moment value, is in the follow-up stereo process process of carrying out, and constantly is worth according to first of each the sign language information that records, selects the voice messaging of participation stereo process.

Step 330, other decoded audio-frequency informations of synthetic speech information exchange of described generation are carried out stereo process.

Particularly, MCU carries out stereo process with the synthetic speech information that generates with other decoded audio-frequency informations, and described stereo process is so that the user in the Multi-Party Conference receives the gratifying signal of voice quality.

Described MCU carries out stereo process with the synthetic speech information of described generation with other decoded audio-frequency informations and is specially:

MCU carries out from large to little ordering the moment of first in the step 320 value; According to the described first constantly ordering of value, MCU with the synthetic speech information that generates according to the amplification that gains of default gain coefficient; And calculate the energy value of other decoded audio-frequency informations, according to energy value from large to little ordering, with the amplification that gains of the gain coefficient of other decoded audio-frequency informations; Synthetic speech information and other decoded audio-frequency informations of the generation of MCU after to gain process carry out stereo process.

In embodiments of the present invention, the number of the audio-frequency information of participation stereo process is no more than first threshold.

Further, when decoded data message did not comprise audio-frequency information in the decoded information, then MCU a plurality of synthetic speech information that will generate were carried out stereo process, and a plurality of synthetic speech information that participate in stereo process are no more than first threshold.

Further, in embodiments of the present invention, when decoded data message comprises sign language information and audio-frequency information in the decoded information, then explanation had both existed the deaf-mute to express, also exist the normal person to express, then MCU preferentially is converted to voice messaging with sign language information, generate synthetic speech information, and be no more than in the situation of first threshold, synthetic speech information is carried out stereo process with other decoded audio-frequency informations, when synthetic speech information surpasses in the situation of first threshold, preferentially only synthetic speech information is carried out stereo process, other decoded audio-frequency informations are given up, guaranteed priority treatment deaf-mute's expression, solve deaf-mute and normal person's the problem that exchanges.

Step 340, the audio-frequency information after the audio mixing is sent to described at least two meeting-place.

Particularly, MCU is in the synthetic speech information that will generate after decoded audio-frequency information carries out stereo process with other, audio-frequency information after the audio mixing is sent at least two meeting-place, and described at least two meeting-place comprise the meeting-place that sends data code flow, and do not send the meeting-place of data code flow.

Alternatively, before step 310, the embodiment of the invention also comprises, MCU receives step user's input or the customer attribute information that conference terminal sends, MCU passes through to receive customer attribute information, thereby when carrying out the phonetic synthesis processing, generates the synthetic speech information of mating with customer attribute information, so that the listener in the meeting-place feels when listening to that truly enhancing exchanges telepresenc.

MCU receives customer attribute information user's input or that described conference terminal sends.

Particularly, before Multi-Party Conference began, the deaf-mute can input the attribute information of self among the MCU, and described customer attribute information comprises: sex, age, nationality etc.; Perhaps, the deaf-mute can be input to the attribute information of self in the conference terminal in its meeting-place, place, sends to MCU by the conference terminal unification.

MCU carries out phonetic synthesis to the voice messaging after the described conversion and processes according to described customer attribute information, generates the synthetic speech information with described customer attribute information coupling.

Particularly, MCU is after being converted to voice messaging with sign language information, according to sign language information, obtain the customer attribute information corresponding with sign language information, and according to customer attribute information, the voice messaging after the conversion is carried out phonetic synthesis process, generate the synthetic speech information with the customer attribute information coupling.For example, the sign language information of conference terminal collection is that a Chinese middle-aged male is made, then MCU is behind the voice messaging that sign language information is converted to, then obtain the customer attribute information corresponding with this sign language information, and according to customer attribute information, voice messaging after the conversion is carried out phonetic synthesis to be processed, generate the synthetic speech information with the customer attribute information coupling, simultaneously MCU is also according to the speed of sign language information, adjust the word speed of synthetic speech information, tones etc. are so that the normal person in other meeting-place is when listening to.Feel truer, strengthen exchanging telepresenc.

Alternatively, after step 340, the embodiment of the invention also comprises, other decoded audio-frequency informations that participate in stereo process are converted to sign language information, and the step processed of the sign language information after will changing, MCU is converted to sign language information by carrying out with other decoded audio-frequency informations, and the step processed of the sign language information after will changing, so that the deaf-mute in the meeting-place is in expression self wish, also the wish of clear and definite normal person's expression freely exchanges effectively so that realize better deaf-mute and normal person.

Described other decoded audio-frequency informations that MCU will participate in stereo process are converted to sign language information.

Particularly, other decoded audio-frequency informations that MCU will participate in stereo process are converted to sign language information, and other decoded audio-frequency informations that have neither part nor lot in stereo process are not then changed.

MCU is according to the energy value of described described other the decoded audio-frequency informations that calculate, and carries out convergent-divergent according to the described sign language information of default ratio after with described conversion and processes.

Particularly, MCU is according to the ordering of the energy value of other decoded audio-frequency informations in step 330, and the sign language information after will change according to default ratio is carried out the convergent-divergent processing.Further, image corresponding to sign language information after the large conversion of energy value amplified processing, image corresponding to sign language information after the little conversion of energy value dwindled processing.

Utilize overlay model or many image modes, the described sign language information after MCU will process through convergent-divergent and the present image in the described conference terminal superpose, in order to show at least two meeting-place.

Particularly, utilize overlay model or many image modes, the sign language information after MCU will process through convergent-divergent and the present image in the conference terminal superpose, in order to show at least two meeting-place.

Described overlay model specifically refers to a plurality of sign language information together are superimposed upon on the present image in the conference terminal, and is presented in the display screen of conference terminal, and this kind overlay model can block the present image in the meeting terminal, as shown in Figure 5; Described many image modes specifically refer to, the present image in a plurality of sign language information and the conference terminal together is presented in the display screen of conference terminal, and image mode more than this kind can not block the present image in the meeting terminal, as shown in Figure 6.

For example, overlay model as shown in Figure 5, and in Fig. 5, the image in each meeting-place is to obtain after the audio-frequency information conversion, image in the meeting-place 2 is because corresponding audio-frequency information energy value is maximum, therefore, after being converted to sign language information, image amplifies processing, the image in meeting-place 3 and meeting-place 4 is because corresponding audio-frequency information energy value is minimum, therefore, after being converted to sign language information, image dwindles processing.Again for example, such as many image modes of Fig. 6, in like manner, no longer repeat at this.

By above-mentioned to the sign language information processing that zooms in or out after the conversion, so that the deaf-mute in the meeting-place is when the wish of watching the normal person to express, the wish of optionally watching simultaneously the normal person in a plurality of meeting-place to express, and the audio-frequency information according to the normal person, the wish of watching the normal person to express of emphasis is arranged, freely effectively exchange with the normal person so that realize better the deaf-mute.

Preferably, after step 340, the embodiment of the invention also comprises, sign language information is converted to the step of text message, be converted to the step of text message by carrying out sign language information, can assist the communication of Multi-Party Conference, especially in sign language information not when the main screen in meeting-place is play, utilize Word message can assist the communication exchange of meeting.

MCU is converted to text message with described sign language information, and the present image in described text message and the described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

Particularly, MCU is converted to text message with sign language information, after conversion, the present image in text message and the conference terminal is carried out the image overlap-add procedure.Sign language information is converted to the text message of captions form, when deaf-mute's sign language information does not have to play in the main screen in the meeting-place, utilizes text message can assist the communication exchange of meeting.

The below illustrates the processing method of the information that the embodiment of the invention provides as an example of a specific embodiment example.

For example, in the Multi-Party Conference, the deaf-mute in meeting-place 1 and meeting-place 2 carries out sign language and expresses, the normal person in meeting-place 3 and meeting-place 4 passes through phonetic representation, belong to the conference terminal collection deaf-mute sign language information in meeting-place 1 and meeting-place 2, behind the coding, send to MCU, the user terminal that belongs to meeting-place 3 and meeting-place 4 picks up normal person's the sound of speaking, behind the coding, send to MCU, after the MCU decoding, obtain the first sign language information in meeting-place 1 and the second sign language information in meeting-place 2, obtain first audio-frequency information in meeting-place 3 and second audio-frequency information in meeting-place 4, the MCU record obtains the first moment value of each sign language information, and sign language information is converted to voice messaging, namely the first sign language information is converted to the first voice messaging, and the second sign language information is converted to the second voice messaging.

The meeting-place number that MCU clearly sends sign language information is 2, compares with first threshold, and preseting first threshold is 4, illustrates that the maximum audio mixing that MCU can bear can number of fields be 4 meeting-place.Be no more than first threshold owing to send the meeting-place number of sign language information, then, MCU carries out the phonetic synthesis processing according to the customer attribute information that formerly receives with the first voice messaging and the second voice, generates the first synthetic speech information and the second synthetic speech information with the customer attribute information coupling.

MCU carries out from large to little ordering the first moment value; The sign language information of meeting-place 1 transmission sent before meeting-place 2 in embodiments of the present invention, MCU respectively with the first synthetic speech information and the second synthetic speech information according to the amplification that gains of default gain coefficient, be that the gain coefficient of 1, the second synthetic speech information is 0.8 such as the gain coefficient of the first synthetic speech information; MCU also calculates the energy value of the first audio-frequency information and the second audio-frequency information, according to energy value from large to little ordering, the energy value of the first audio-frequency information is greater than the energy value of the second audio-frequency information in embodiments of the present invention, with the amplification that gains of the gain coefficient of the first audio-frequency information and the second audio-frequency information, as, the gain of the first audio-frequency information is amplified the gain of 0.8, the second audio-frequency information and is amplified 0.6; Synthetic speech information and the audio-frequency information of the generation after MCU utilizes the audio mixing pattern to gain process carry out stereo process.

MCU carries out the audio-frequency information after the audio mixing being sent at least two meeting-place after stereo process finishes.

Simultaneously, the first audio-frequency information and the second audio conversion that MCU also will participate in stereo process are the first sign language information after changing and the second sign language information after the conversion, MCU is according to the ordering of the first audio-frequency information and the second audio-frequency information energy value, sign language information after will change according to default ratio is carried out the convergent-divergent processing, as, in embodiments of the present invention, because the energy value of the first audio-frequency information is large, the first sign language information after then will changing is amplified processing, and the second sign language information after the conversion is dwindled processing.Utilize overlay model or many image modes, the first sign language information after MCU will process through convergent-divergent and the present image in the second sign language information and the conference terminal superpose, in order to show at least two meeting-place.

Further, MCU also can be converted to text message with the first sign language information and the second sign language information, after conversion, the present image in the first text message and the second text message and the conference terminal is carried out the image overlap-add procedure.The first sign language information and the second sign language information are converted to the first text message and second text message of captions form, when the first sign language information of deaf-mute and the second sign language information are not play in the main screen in the meeting-place, utilize the first text message and the second text message can assist the communication exchange of meeting.

Correspondingly, the embodiment of the invention two also provides the processing unit of the audiovisual information in a kind of video conference, in order to realize the processing method of the audiovisual information in the video conference among the embodiment one, as shown in Figure 7, the processing unit of described information comprises: decoding unit 710, conversion synthesis unit 720, stereo process unit 730 and transmitting element 740.

Decoding unit 710 in the described device, be used for receiving the data code flow that at least two conference terminals send, and described data code flow is decoded, and obtain at least two-way decoded information;

Conversion synthesis unit 720, when having sign language information at least two-way decoded information for definite described decoding unit, described sign language information is converted to voice messaging, and the voice messaging after the described conversion is carried out phonetic synthesis process, generate synthetic speech information;

Stereo process unit 730 is used for the synthetic speech information that described conversion synthesis unit generates is carried out stereo process with other decoded audio-frequency informations;

Transmitting element 740 is used for stereo process unit audio mixing audio-frequency information afterwards and sends to described at least two meeting-place.

Described decoding unit 710 also is used for, and receives customer attribute information user's input or that described conference terminal sends;

Described conversion synthesis unit 720 specifically is used for, and according to described customer attribute information, the voice messaging after the described conversion is carried out phonetic synthesis process, and generates the synthetic speech information with described customer attribute information coupling.

Described device also comprises: judging unit 750 is used for judging whether the meeting-place number at the described conference terminal place that sends described sign language information surpasses first threshold, and judged result is sent to described conversion synthesis unit;

Described conversion synthesis unit 720 specifically is used for, and when receiving the described meeting-place of described judgment unit judges number and be no more than the judged result of described first threshold, the described voice messaging after the conversion is carried out phonetic synthesis process, and generates synthetic speech information;

Described conversion synthesis unit 720 specifically is used for, when receiving the described meeting-place of described judgment unit judges number and surpass the judged result of described first threshold, carry out phonetic synthesis and process being no more than described voice messaging after the conversion of described first threshold, generate synthetic speech information.

Described conversion synthesis unit 720 also is used for, the first moment value of the described sign language information of each that obtains behind the carrying recorded decoding;

Described stereo process unit 730 specifically is used for: described the first moment value is carried out from large to little ordering;

Described device also comprises: sign language converting unit 760 is converted to sign language information for described other decoded audio-frequency informations that will participate in stereo process;

Convergent-divergent processing unit 770 is used for the energy value according to described described other the decoded audio-frequency informations that calculate, and carries out convergent-divergent according to the described sign language information of default ratio after with described conversion and processes;

Superpositing unit 780, the described sign language information after being used for processing through convergent-divergent and the present image of described conference terminal superpose, in order to show at least two meeting-place.

Described device also comprises: text-converted unit 790 is used for described sign language information is converted to text message, and the present image in described text message and the described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

By using the processing unit of the audiovisual information in the video conference that the embodiment of the invention provides, after the data code flow that processing unit sends the meeting terminal is decoded, when decoded data message is sign language information, sign language information is converted to voice messaging, and the voice messaging after the conversion is processed rear generation synthetic speech information; The synthetic speech information that generates is carried out stereo process with other decoded audio-frequency informations; Audio-frequency information after the audio mixing is sent at least two meeting-place, thereby solved in the prior art deaf and dumb personage when participating in many ways Remote Video Conference, the problem that can't freely effectively exchange.

In addition, the processing unit of the audiovisual information in the video conference that the embodiment of the invention two provides can also adopt implementation as follows, in order to realize the processing method of the audiovisual information in the video conference in the embodiment of the invention one, as shown in Figure 8, the processing unit of described information comprises: network interface 810, processor 820 and memory 830.System bus 840 is used for interconnection network interface 810, processor 820 and memory 830.

Network interface 810 is used for communicating with conference terminal.

Memory 830 can be permanent memory, and for example hard disk drive and flash memory have software module and device driver in the memory 830.Software module can be carried out the various functional modules of said method of the present invention; Device driver can be network and interface drive program.

When starting, these component softwares are loaded in the memory 830, are then accessed and carry out as giving an order by processor 820:

Further, behind the component software of described processor access memory 830, carry out the data code flow that receives at least two conference terminals transmissions, and described data code flow decoded, obtaining the instruction that following process is carried out at least two-way decoded information instruction before:

Further, behind the component software of described processor access memory 830, carry out the voice messaging after the described conversion carried out the phonetic synthesis processing, generate the concrete instruction of synthetic speech information process:

Further, behind the component software of described processor access memory 830, before being converted to the voice messaging instruction with described sign language information, execution carries out the instruction of following process:

The first moment value of the described sign language information of each that obtains behind the carrying recorded decoding;

The concrete instruction of stereo process process is carried out the synthetic speech information of described generation in execution with other decoded audio-frequency informations:

Described the first moment value is carried out from large to little ordering;

Further, behind the component software of described processor access memory 830, carry out the instruction of following process:

Described sign language information is converted to text message, and the present image in described text message and the described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

The professional should further recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.The professional and technical personnel can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought and exceeds scope of the present invention.

The method of describing in conjunction with embodiment disclosed herein or the step of algorithm can use the software module of hardware, processor execution, and perhaps the combination of the two is implemented.Software module can place the storage medium of any other form known in random asccess memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or the technical field.

Above-described embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the above only is the specific embodiment of the present invention; the protection range that is not intended to limit the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the processing method of the audiovisual information in the video conference is characterized in that described video conference comprises at least two meeting-place, and described each meeting-place comprises a conference terminal at least, and described method comprises:

2. the processing method of the audiovisual information in the video conference according to claim 1, it is characterized in that, the data code flow that at least two conference terminals of described reception send, and described data code flow decoded, obtain also comprising before the two-way decoded information at least:

3. the processing method of the audiovisual information in the video conference according to claim 1 is characterized in that, describedly voice messaging after the described conversion is carried out phonetic synthesis processes, and generates synthetic speech information and is specially:

4. the processing method of the audiovisual information in the video conference according to claim 1 is characterized in that, described described sign language information is converted to also comprises before the voice messaging: the first value constantly of the described sign language information of each that obtains behind the carrying recorded decoding;

Described the first moment value is carried out from large to little ordering;

5. the processing method of the audiovisual information in the video conference according to claim 4 is characterized in that, described method also comprises:

6. the processing method of the audiovisual information in the video conference according to claim 1 is characterized in that, described method also comprises:

7. the processing unit of the audiovisual information in the video conference is characterized in that described device comprises:

8. the processing unit of the audiovisual information in the video conference according to claim 7 is characterized in that, described decoding unit also is used for, and receives customer attribute information user's input or that described conference terminal sends;

9. the processing unit of the audiovisual information in the video conference according to claim 7 is characterized in that, described device also comprises,

10. the processing unit of the audiovisual information in the video conference according to claim 7 is characterized in that, described conversion synthesis unit also is used for, the first moment value of the described sign language information of each that obtains behind the carrying recorded decoding;

Described stereo process unit specifically is used for:

Described the first moment value is carried out from large to little ordering;

11. the processing unit of the audiovisual information in the video conference according to claim 10 is characterized in that, described device also comprises:

12. the processing unit of the audiovisual information in the video conference according to claim 7 is characterized in that, described device also comprises:

The text-converted unit is used for described sign language information is converted to text message, and the present image in described text message and the described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

13. video conferencing system, it is characterized in that, described system comprises: at least two meeting-place, and described each meeting-place comprises a conference terminal at least, and such as claim 8 processing unit of audiovisual information in the described video conference of arbitrary claim to the claim 14.