CN102984496B - The processing method of the audiovisual information in video conference, Apparatus and system - Google Patents

The processing method of the audiovisual information in video conference, Apparatus and system Download PDF

Info

Publication number
CN102984496B
CN102984496B CN201210560387.7A CN201210560387A CN102984496B CN 102984496 B CN102984496 B CN 102984496B CN 201210560387 A CN201210560387 A CN 201210560387A CN 102984496 B CN102984496 B CN 102984496B
Authority
CN
China
Prior art keywords
described
information
sign language
meeting
carried out
Prior art date
Application number
CN201210560387.7A
Other languages
Chinese (zh)
Other versions
CN102984496A (en
Inventor
倪为
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201210560387.7A priority Critical patent/CN102984496B/en
Publication of CN102984496A publication Critical patent/CN102984496A/en
Application granted granted Critical
Publication of CN102984496B publication Critical patent/CN102984496B/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Abstract

The embodiment of the present invention relates to processing method, the Apparatus and system of the audiovisual information in a kind of video conference.Described method comprises: receive the data code flow that at least two conference terminals send, and decoded by described data code flow, obtain at least two-way decoded information; When at least there is sign language information in two-way decoded information described in determining, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after described conversion, generate synthetic speech information; The synthetic speech information of described generation is carried out stereo process with other decoded audio-frequency informations; At least two meeting-place described in audio-frequency information after audio mixing is sent to.

Description

The processing method of the audiovisual information in video conference, Apparatus and system

Technical field

The present invention relates to communication technique field, particularly relate to the processing method of the audiovisual information in a kind of video conference, Apparatus and system.

Background technology

Along with the development and progress of society, deaf and dumb personage more and more receives social attention and concern as the disadvantaged group of society.In life and work, deaf and dumb personage also becomes more and more with mutual interchange between normal person.Universal along with modern education, sign language has become between deaf-mute and has exchanged, and a kind of general mode exchanged with normal person.But sign language needs to carry out special training and study, normal person is generally except special requirement, and the personage grasping sign language is relatively less, causes the communication disorders of normal person and deaf and dumb personage.

At present, between deaf and dumb personage or deaf and dumb personage and normal person's remote communication time, the equipment that general is special or system realize, as utilized the terminal realizing sign language, word, speech conversion, or allow the intertranslation work of third-party involvement sign language/voice, solve between deaf and dumb personage or the problem of deaf and dumb personage and normal person's face-to-face exchange with this.

As shown in Figure 1, Fig. 1 is the schematic diagram carrying out remote communication between deaf and dumb personage.Deaf-mute A carries out sign language expression, utilize video communication terminal A (as visual telephone, video conference terminal, desktop software terminal etc.) collect its sign language image after, video communication terminal B is transferred to through communication network, deaf-mute B is presented by video communication terminal B's, sees the sign language image of deaf-mute A, understands the meaning that the other side expresses.Vice versa, and then complete whole communication process.

As shown in Figure 2, Fig. 2 is the schematic diagram that deaf and dumb personage and normal person carry out remote communication, and the sign language image of deaf-mute is gathered by video communication terminal A, is distributed in video communication terminal B, the C at normal person and translator place presents through multipoint control unit; After translator understands, translated into voice and picked up by video communication terminal, and be distributed in the video communication terminal B at normal person place by multipoint control unit, normal person, by the voice of translator, understanding of the content that deaf-mute expresses.

The voice of normal person are picked up by video communication terminal B, and be distributed in the video communication terminal C at translator place by multipoint control unit, translator is translated into sign language, sign language image gathers through video communication terminal, deaf-mute is distributed to by multipoint control unit, deaf-mute, by the sign language image of translator, understanding of the content that normal person expresses.

Along with the growth gradually of exchange and conmmunication between deaf and dumb personage and normal person, prior art also exposes following drawback: when 1) deaf and dumb personage exchanges with normal person, and each meeting all needs the participation of third party translator, adds the human cost of communication; 2) when in Multi-Party Conference, if there is multiple deaf and dumb personage to show sign language action or multiple normal person when talking simultaneously, translator cannot well process for this situation, to give expression to the content of each teller clearly.Therefore, prior art has some limitations, and does not have solution to have deaf and dumb personage to participate in Multi-Party Conference and exchanges problems faced.

Summary of the invention

The object of the invention is in order to solve deaf and dumb personage in prior art participate in many ways Remote Video Conference time, the problem that cannot freely effectively exchange, provides the processing method of the audiovisual information in a kind of video conference, equipment and system.

In first aspect, embodiments provide the processing method of the audiovisual information in a kind of video conference, described video conference comprises at least two meeting-place, and each meeting-place described at least comprises a conference terminal, and described method comprises:

Receive the data code flow that at least two conference terminals send, and described data code flow is decoded, obtain at least two-way decoded information;

When at least there is sign language information in two-way decoded information described in determining, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after described conversion, generate synthetic speech information;

The synthetic speech information of described generation is carried out stereo process with other decoded audio-frequency informations;

At least two meeting-place described in audio-frequency information after audio mixing is sent to.

In the implementation that the first is possible, the data code flow that described reception at least two conference terminals send, and being decoded by described data code flow, obtains at least also comprising before two-way decoded information:

The customer attribute information that receive user's input or described conference terminal sends;

Described phonetic synthesis process is carried out to the voice messaging after described conversion, generates synthetic speech information and be specially:

According to described customer attribute information, phonetic synthesis process is carried out to the voice messaging after described conversion, generate the synthetic speech information of mating with described customer attribute information.

In the implementation that the second is possible, described phonetic synthesis process is carried out to the voice messaging after described conversion, generates synthetic speech information and be specially:

Judge whether the meeting-place number at the described conference terminal place sending described sign language information exceedes first threshold;

If be no more than described first threshold, then the described voice messaging after described conversion is carried out phonetic synthesis process, generate synthetic speech information;

If exceed described first threshold, then the described voice messaging after the described conversion being no more than described first threshold is carried out phonetic synthesis process, generate synthetic speech information.

In the implementation that the third is possible, described described sign language information is converted to voice messaging before also comprise: the first moment value of each described sign language information obtained after carrying recorded decoding;

Describedly the synthetic speech information of described generation carried out stereo process with other decoded audio-frequency informations be specially:

Carry out from large to little sequence to described first moment value;

According to the sequence of described first moment value, the synthetic speech information of described generation is carried out gain amplification according to the gain coefficient preset;

Calculate the energy value of other decoded audio-frequency informations described, according to described energy value from large to little sequence, the gain coefficient of other decoded audio-frequency informations described is carried out gain amplification;

Stereo process is carried out, at least two meeting-place described in being sent to by the audio-frequency information after audio mixing to the synthetic speech information of the described generation after gain process and other decoded audio-frequency informations described.

In conjunction with the third possible implementation of first aspect or first aspect, in the 4th kind of possible implementation, described method also comprises:

Other decoded audio-frequency informations described in participation stereo process are converted to sign language information;

According to the sequence of the energy value of other decoded audio-frequency informations described, according to the ratio preset, the described sign language information after described conversion is carried out convergent-divergent process;

Described sign language information after convergent-divergent process is superposed with the present image in described conference terminal, in order to show at least two meeting-place.

In the 5th kind of possible implementation, described method also comprises: described sign language information is converted to text message, and the present image in described text message and described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

In second aspect, embodiments provide the processing unit of the audiovisual information in a kind of video conference, described device comprises:

Decoding unit, for receiving the data code flow that at least two conference terminals send, and decoding described data code flow, obtaining at least two-way decoded information;

Conversion synthesis unit, during for determining to there is sign language information at least two-way decoded information in described decoding unit, being converted to voice messaging by described sign language information, and carrying out phonetic synthesis process to the voice messaging after described conversion, generates synthetic speech information;

Stereo process unit, carries out stereo process for the synthetic speech information generated by described conversion synthesis unit with other decoded audio-frequency informations;

Transmitting element, at least two meeting-place described in sending to for the audio-frequency information after audio mixing in stereo process unit.

In the implementation that the first is possible, described decoding unit is also for, that receive user's input or that described conference terminal sends customer attribute information;

Described conversion synthesis unit specifically for, according to described customer attribute information, phonetic synthesis process is carried out to the voice messaging after described conversion, generates the synthetic speech information of mating with described customer attribute information.

In the implementation that the second is possible, described device also comprises:

Judging unit, for judging whether the meeting-place number at the described conference terminal place sending described sign language information exceedes first threshold, and sends to described conversion synthesis unit by judged result;

Described conversion synthesis unit specifically for, when receiving described judging unit and judging that described meeting-place number is no more than the judged result of described first threshold, the described voice messaging after conversion is carried out phonetic synthesis process, generates synthetic speech information;

Described conversion synthesis unit specifically for, when receiving described judging unit and judging that described meeting-place number exceedes the judged result of described first threshold, described voice messaging after the conversion being no more than described first threshold is carried out phonetic synthesis process, generates synthetic speech information.

Synthesis unit is changed also for, the first moment value of each described sign language information obtained after carrying recorded decoding described in the implementation that the third is possible;

Described stereo process unit specifically for: carry out from large to little sequence to described first moment value;

According to the sequence of described first moment value, the synthetic speech information of described generation is carried out gain amplification according to the gain coefficient preset;

Calculate the energy value of other decoded audio-frequency informations described, according to described energy value from large to little sequence, the gain coefficient of other decoded audio-frequency informations described is carried out gain amplification;

Stereo process is carried out, at least two meeting-place described in being sent to by the audio-frequency information after audio mixing to the synthetic speech information of the described generation after gain process and other decoded audio-frequency informations described.

In conjunction with the third possible implementation of second aspect or second aspect, in the 4th kind of possible implementation, described device also comprises:

Sign language converting unit, for being converted to sign language information by other decoded audio-frequency informations described in participation stereo process;

Convergent-divergent processing unit, for the energy value of other decoded audio-frequency informations described in calculating described in basis, carries out convergent-divergent process according to the ratio preset by the described sign language information after described conversion;

Superpositing unit, for being superposed with the present image in described conference terminal by the described sign language information after convergent-divergent process, in order to show at least two meeting-place.

In the 5th kind of possible implementation, described device also comprises: text conversion units, for described sign language information is converted to text message, and the present image in described text message and described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

In the third aspect, embodiments provide the treatment system of the audiovisual information in a kind of video conference, described system comprises: at least two meeting-place, each meeting-place described at least comprises a conference terminal, and the processing unit of audiovisual information in video conference as described in claim arbitrary in claim 8 to claim 14.

Therefore, processing method, the equipment and system of the audiovisual information in the video conference provided by the application embodiment of the present invention, after the data code flow decoding that conference terminal sends by multipoint control unit, when decoded data message is sign language information, sign language information is converted to voice messaging, and rear generation synthetic speech information is processed to the voice messaging after conversion; The synthetic speech information of generation is carried out stereo process with other decoded audio-frequency informations; Audio-frequency information after audio mixing is sent at least two meeting-place, thus solve deaf and dumb personage in prior art participate in many ways Remote Video Conference time, the problem that cannot freely effectively exchange.

Accompanying drawing explanation

Fig. 1 is the schematic diagram carrying out remote communication in prior art between deaf and dumb personage;

Fig. 2 is the schematic diagram that in prior art, deaf and dumb personage and normal person carry out remote communication;

The process flow figure of the audiovisual information in the video conference that Fig. 3 provides for the embodiment of the present invention one;

The treatment system schematic diagram of the audiovisual information in the video conference that Fig. 4 provides for the embodiment of the present invention;

The imaging importing schematic diagram that Fig. 5 provides for the embodiment of the present invention;

The imaging importing schematic diagram that Fig. 6 provides for the embodiment of the present invention;

The processing unit structure chart of the audiovisual information in the video conference that Fig. 7 provides for the embodiment of the present invention two;

The processing unit structure chart of the audiovisual information in the video conference that Fig. 8 provides for the embodiment of the present invention three.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the specific embodiment of the invention is described in further detail.

The processing method of the information that the embodiment of the present invention provides is described for Fig. 3 below, the process flow figure of the audiovisual information in the video conference that Fig. 3 provides for the embodiment of the present invention one, subject of implementation is multipoint control server in embodiments of the present invention, with multipoint control unit, (MultipointControl Unit is called for short: MCU) for example is described below.As shown in Figure 3, this embodiment comprises the following steps:

The data code flow that step 310, reception at least two conference terminals send, and described data code flow is decoded, obtain at least two-way decoded information.

Particularly, in Multi-Party Conference, as shown in Figure 4, the treatment system schematic diagram of the audiovisual information in the video conference that Fig. 4 provides for the embodiment of the present invention, each conference terminal is present in each meeting-place of Multi-Party Conference, and video conference comprises at least two meeting-place (as diagram, Fig. 4 comprises 4 meeting-place, each meeting-place comprises 1 conference terminal, be appreciated that, 4 meeting-place are not limited to) in practical application, each meeting-place at least comprises a conference terminal, conference terminal is for gathering and export the audio/video information in meeting-place, pickup original user information, described original user information is specially the sign language information of user, voice messaging etc., original user information is encoded by conference terminal, form data code flow, and data code flow is sent to MCU, MCU receives the data code flow that conference terminal sends.

In embodiments of the present invention, described conference terminal refers to the equipment having and gather image, pick up sound, accept outside input function, and is responsible for the video image of acquisition to send to display to show, and sends to loud speaker to play the audio-frequency information received, such as, video terminal.

After MCU receives data code flow, decode to data code flow, obtain at least two-way decoded information, described decoded information comprises the sign language information that conference terminal gathers, the audio-frequency information etc. of pickup.

Step 320, when determining at least there is sign language information in two-way decoded information, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after described conversion, generate synthetic speech information.

Particularly, after MCU decodes, when MCU determines at least there is sign language information in two-way decoded information, then sign language information is converted to voice messaging by MCU.

Described MCU determines that at least there is sign language information in two-way decoded information is specially, and after MCU decodes, reduce according to decoded data, when decoded data can be reduced to sign language information by MCU, then sign language information is converted to voice messaging by MCU; When decoded data can be reduced to audio-frequency information by MCU, then stereo process be carried out to audio-frequency information, or audio-frequency information is converted to sign language information etc.

Further, described sign language information is the gesture motion made arbitrary deaf-mute of conference terminal collection, when deaf-mute in meeting-place needs the suggestion expressing self, then conference terminal faced by deaf-mute carries out sign language expression, conference terminal gathers the sign language information of deaf-mute, and send to MCU after encoded, MCU is after decoding, obtain sign language information, then sign language information is converted to voice messaging.

After sign language information is converted to voice messaging by MCU, phonetic synthesis is carried out to the voice messaging after conversion, generate synthetic speech information.

Described MCU carries out phonetic synthesis process to the voice messaging after described conversion, generates synthetic speech information and is specially:

MCU judge to send sign language information discuss terminal place meeting-place number whether exceed first threshold; If be no more than described first threshold, then the described voice messaging after conversion is carried out phonetic synthesis process, generate synthetic speech information; If exceed described first threshold, then the described voice messaging after the conversion being no more than described first threshold is carried out phonetic synthesis process, generate synthetic speech information.

In embodiments of the present invention, described first threshold is the meeting-place number of the maximum audio mixing that MCU can bear, and the meeting-place of general maximum audio mixing is cubic audio mixing.

Wherein, MCU before sign language information is converted to voice messaging by execution, the first moment value of each sign language information that MCU also obtains after carrying recorded decoding.MCU records described first moment value, is for carrying out in stereo process process follow-up, according to the first moment value of each sign language information of record, selects the voice messaging participating in stereo process.

Step 330, the synthetic speech information of described generation is led to other decoded audio-frequency informations carry out stereo process.

Particularly, the synthetic speech information of generation is carried out stereo process with other decoded audio-frequency informations by MCU, and described stereo process is make the user in Multi-Party Conference all receive the gratifying signal of voice quality.

The synthetic speech information of described generation is carried out stereo process with other decoded audio-frequency informations and is specially by described MCU:

MCU carries out from large to little sequence to the first moment value in step 320; According to the sequence of described first moment value, the synthetic speech information of generation is carried out gain amplification according to the gain coefficient preset by MCU; And calculate the energy value of other decoded audio-frequency informations, according to energy value from large to little sequence, the gain coefficient of other decoded audio-frequency informations is carried out gain amplification; MCU carries out stereo process to the synthetic speech information of the generation after gain process and other decoded audio-frequency informations.

In embodiments of the present invention, the number participating in the audio-frequency information of stereo process is no more than first threshold.

Further, when data message decoded in decoded information does not comprise audio-frequency information, then the multiple synthetic speech information generated are carried out stereo process by MCU, and the multiple synthetic speech information participating in stereo process are no more than first threshold.

Further, in embodiments of the present invention, when data message decoded in decoded information comprises sign language information and audio-frequency information, then illustrate that both having there is deaf-mute expresses, also there is normal person to express, then sign language information is preferentially converted to voice messaging by MCU, generate synthetic speech information, and when being no more than first threshold, synthetic speech information is carried out stereo process with other decoded audio-frequency informations, when synthetic speech information exceedes first threshold, preferentially only synthetic speech information is carried out stereo process, other decoded audio-frequency informations are given up, ensure the expression of priority treatment deaf-mute, solve the communicating questions of deaf-mute and normal person.

Step 340, the audio-frequency information after audio mixing sent to described at least two meeting-place.

Particularly, MCU is after carrying out stereo process by the synthetic speech information of generation with other decoded audio-frequency informations, audio-frequency information after audio mixing is sent at least two meeting-place, and described at least two meeting-place comprise the meeting-place sending data code flow, and do not send the meeting-place of data code flow.

Therefore, processing method, the equipment and system of the audiovisual information in the video conference provided by the application embodiment of the present invention, after the data code flow decoding that conference terminal sends by multipoint control unit, when decoded data message is sign language information, sign language information is converted to voice messaging, and rear generation synthetic speech information is processed to the voice messaging after conversion; The synthetic speech information of generation is carried out stereo process with other decoded audio-frequency informations; Audio-frequency information after audio mixing is sent at least two meeting-place, thus solve deaf and dumb personage in prior art participate in many ways Remote Video Conference time, the problem that cannot freely effectively exchange.

Alternatively, before step 310, the embodiment of the present invention also comprises, the step of that MCU receives user's input or that conference terminal sends customer attribute information, MCU by receiving customer attribute information, thus when carrying out phonetic synthesis process, generates the synthetic speech information of mating with customer attribute information, making the listener in meeting-place feel when listening to truly, strengthening and exchanging telepresenc.

The customer attribute information that MCU receives user's input or described conference terminal sends.

Particularly, before Multi-Party Conference starts, deaf-mute can by the attribute information of self input MCU, and described customer attribute information comprises: sex, age, nationality etc.; Or the attribute information of self can be input in the conference terminal in its meeting-place, place by deaf-mute, send to MCU by conference terminal is unified.

MCU, according to described customer attribute information, carries out phonetic synthesis process to the voice messaging after described conversion, generates the synthetic speech information of mating with described customer attribute information.

Particularly, MCU after sign language information is converted to voice messaging, according to sign language information, obtain the customer attribute information corresponding with sign language information, and according to customer attribute information, phonetic synthesis process is carried out to the voice messaging after conversion, generate the synthetic speech information of mating with customer attribute information.Such as, the sign language information of conference terminal collection is that a Chinese middle-aged male made, then MCU is after the voice messaging be converted to by sign language information, then obtain the customer attribute information corresponding with this sign language information, and according to customer attribute information, phonetic synthesis process is carried out to the voice messaging after conversion, generate the synthetic speech information of mating with customer attribute information, MCU is also according to the speed of sign language information simultaneously, adjust the word speed of synthetic speech information, tones etc., make the normal person in other meeting-place when listening to.Sensation is truer, strengthens and exchanges telepresenc.

Alternatively, after step 340, the embodiment of the present invention also comprises, other the decoded audio-frequency informations participating in stereo process are converted to sign language information, and the sign language information after conversion is carried out the step that processes, other decoded audio-frequency informations are converted to sign language information by performing by MCU, and the sign language information after conversion is carried out the step that processes, make deaf-mute in meeting-place while self wish of expression, also the wish of clear and definite normal person's expression, makes to realize deaf-mute better and freely effectively exchanges with normal person.

Other decoded audio-frequency informations described in participation stereo process are converted to sign language information by MCU.

Particularly, other the decoded audio-frequency informations participating in stereo process are converted to sign language information by MCU, then do not change other the decoded audio-frequency informations having neither part nor lot in stereo process.

MCU according to described in the energy value of other decoded audio-frequency informations described that calculates, according to the ratio preset, the described sign language information after described conversion is carried out convergent-divergent process.

Particularly, the sign language information after conversion, according to the sequence of the energy value of other decoded audio-frequency informations in a step 330, is carried out convergent-divergent process according to the ratio preset by MCU.Further, the image that the sign language information after the conversion large to energy value is corresponding carries out amplification process, and the image that the sign language information after the conversion little to energy value is corresponding reduces process.

Utilize overlay model or many image modes, the described sign language information after convergent-divergent process superposes with the present image in described conference terminal by MCU, in order to show at least two meeting-place.

Particularly, utilize overlay model or many image modes, the sign language information after convergent-divergent process superposes with the present image in conference terminal by MCU, in order to show at least two meeting-place.

Described overlay model specifically refers on the present image that to be together superimposed upon by multiple sign language information in conference terminal, and is presented in the display screen of conference terminal, and this kind of overlay model can block the present image in conference terminal, as shown in Figure 5; Described many image modes specifically refer to, are together presented in the display screen of conference terminal by the present image in multiple sign language information and conference terminal, and image mode more than this kind can not block the present image in conference terminal, as shown in Figure 6.

Such as, overlay model as shown in Figure 5, and in Figure 5, the image in each meeting-place is obtain after being changed by audio-frequency information, image in meeting-place 2 due to the audio-frequency information energy value of correspondence maximum, therefore, after being converted to sign language information, image carries out amplification process, the image in meeting-place 3 and meeting-place 4 due to the audio-frequency information energy value of correspondence minimum, therefore, after being converted to sign language information, image carries out reducing process.Again such as, as many image modes of Fig. 6, in like manner, no longer repeat at this.

By above-mentioned, sign language information after conversion is zoomed in or out process, make the deaf-mute in meeting-place when watching the wish of normal person's expression, optionally watch the wish that the normal person in multiple meeting-place expresses simultaneously, and according to the audio-frequency information of normal person, the wish having the viewing normal person of emphasis to express, makes to realize deaf-mute better and freely effectively exchanges with normal person.

Preferably, after step 340, the embodiment of the present invention also comprises, sign language information is converted to the step of text message, the step of text message is converted to by performing sign language information, the communication of Multi-Party Conference can be assisted, especially when the main screen of sign language information not in meeting-place is play, utilize Word message can assist the communication exchange of meeting.

Described sign language information is converted to text message by MCU, and the present image in described text message and described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

Particularly, sign language information is converted to text message by MCU, after conversion, the present image in text message and conference terminal is carried out imaging importing process.Sign language information is converted to the text message of captions form, when the sign language information of deaf-mute does not have to play in the main screen in meeting-place, utilizes text message can assist the communication exchange of meeting.

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the specific embodiment of the invention is described in further detail.

The processing method of the information that the embodiment of the present invention provides is described for a specific embodiment below.

Such as, in Multi-Party Conference, the deaf-mute in meeting-place 1 and meeting-place 2 carries out sign language expression, the normal person in meeting-place 3 and meeting-place 4 passes through phonetic representation, belong to the conference terminal collection deaf-mute sign language information in meeting-place 1 and meeting-place 2, after coding, send to MCU, belong to the sound of speaking of the user terminal pickup normal person in meeting-place 3 and meeting-place 4, after coding, send to MCU, after MCU decoding, obtain first sign language information in meeting-place 1 and second sign language information in meeting-place 2, obtain first audio-frequency information in meeting-place 3 and second audio-frequency information in meeting-place 4, MCU record obtains the first moment value of each sign language information, and sign language information is converted to voice messaging, namely the first sign language information is converted to the first voice messaging, second sign language information is converted to the second voice messaging.

The meeting-place number that MCU clearly sends sign language information is 2, compares with first threshold, and presetting first threshold is 4, illustrates that the maximum audio mixing that MCU can bear can number of fields be 4 meeting-place.Because the meeting-place number sending sign language information is no more than first threshold, then, first voice messaging and the second voice, according to the customer attribute information formerly received, are carried out phonetic synthesis process by MCU, generate the first synthetic speech information and the second synthetic speech information of mating with customer attribute information.

MCU carries out from large to little sequence to the first moment value; The sign language information that meeting-place 1 sends in embodiments of the present invention sent before meeting-place 2, first synthetic speech information and the second synthetic speech information are carried out gain amplification according to the gain coefficient preset by MCU respectively, gain coefficient as the first synthetic speech information is the gain coefficient of the 1, second synthetic speech information is 0.8; MCU also calculates the energy value of the first audio-frequency information and the second audio-frequency information, according to energy value from large to little sequence, the energy value of the first audio-frequency information is greater than the energy value of the second audio-frequency information in embodiments of the present invention, the gain coefficient of the first audio-frequency information and the second audio-frequency information is carried out gain amplification, as, first audio-frequency information gain is amplified the 0.8, second audio-frequency information gain and is amplified 0.6; MCU utilizes audio mixing pattern to carry out stereo process to the synthetic speech information of the generation after gain process and audio-frequency information.

MCU carries out after stereo process completes, the audio-frequency information after audio mixing being sent at least two meeting-place.

Simultaneously, first audio-frequency information of participation stereo process and the second audio conversion are also the first sign language information after conversion and the second sign language information after conversion by MCU, MCU is according to the sequence of the first audio-frequency information and the second audio-frequency information energy value, according to the ratio preset, the sign language information after conversion is carried out convergent-divergent process, as, in embodiments of the present invention, because the energy value of the first audio-frequency information is large, then the first sign language information after conversion is carried out amplification process, the second sign language information after conversion is carried out reducing process.Utilize overlay model or many image modes, the first sign language information after convergent-divergent process and the second sign language information superpose with the present image in conference terminal by MCU, in order to show at least two meeting-place.

Further, the first sign language information and the second sign language information also can be converted to text message by MCU, after conversion, the present image in the first text message and the second text message and conference terminal are carried out imaging importing process.First sign language information and the second sign language information are converted to the first text message and second text message of captions form, when playing in first sign language information and the main screen of the second sign language information not in meeting-place of deaf-mute, utilize the first text message and the second text message can assist the communication exchange of meeting.

Correspondingly, the embodiment of the present invention two additionally provides the processing unit of the audiovisual information in a kind of video conference, in order to realize the processing method of the audiovisual information in the video conference in embodiment one, as shown in Figure 7, the processing unit of described information comprises: decoding unit 710, conversion synthesis unit 720, stereo process unit 730 and transmitting element 740.

Decoding unit 710 in described device, for receiving the data code flow that at least two conference terminals send, and decoding described data code flow, obtaining at least two-way decoded information;

Conversion synthesis unit 720, during for determining to there is sign language information at least two-way decoded information in described decoding unit, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after described conversion, generate synthetic speech information;

Stereo process unit 730, carries out stereo process for the synthetic speech information generated by described conversion synthesis unit with other decoded audio-frequency informations;

Transmitting element 740, at least two meeting-place described in sending to for the audio-frequency information after audio mixing in stereo process unit.

Described decoding unit 710 is also for, that receive user's input or that described conference terminal sends customer attribute information;

Described conversion synthesis unit 720 specifically for, according to described customer attribute information, phonetic synthesis process is carried out to the voice messaging after described conversion, generates the synthetic speech information of mating with described customer attribute information.

Described device also comprises: judging unit 750, for judging whether the meeting-place number at the described conference terminal place sending described sign language information exceedes first threshold, and judged result is sent to described conversion synthesis unit;

Described conversion synthesis unit 720 specifically for, when receiving described judging unit and judging that described meeting-place number is no more than the judged result of described first threshold, the described voice messaging after conversion is carried out phonetic synthesis process, generates synthetic speech information;

Described conversion synthesis unit 720 specifically for, when receiving described judging unit and judging that described meeting-place number exceedes the judged result of described first threshold, described voice messaging after the conversion being no more than described first threshold is carried out phonetic synthesis process, generates synthetic speech information.

Described conversion synthesis unit 720 is also for, the first moment value of each described sign language information obtained after carrying recorded decoding;

Described stereo process unit 730 specifically for: carry out from large to little sequence to described first moment value;

According to the sequence of described first moment value, the synthetic speech information of described generation is carried out gain amplification according to the gain coefficient preset;

Calculate the energy value of other decoded audio-frequency informations described, according to described energy value from large to little sequence, the gain coefficient of other decoded audio-frequency informations described is carried out gain amplification;

Stereo process is carried out, at least two meeting-place described in being sent to by the audio-frequency information after audio mixing to the synthetic speech information of the described generation after gain process and other decoded audio-frequency informations described.

Described device also comprises: sign language converting unit 760, for other decoded audio-frequency informations described in participation stereo process are converted to sign language information;

Convergent-divergent processing unit 770, for the energy value of other decoded audio-frequency informations described in calculating described in basis, carries out convergent-divergent process according to the ratio preset by the described sign language information after described conversion;

Superpositing unit 780, for being superposed with the present image in described conference terminal by the described sign language information after convergent-divergent process, in order to show at least two meeting-place.

Described device also comprises: text conversion units 790, for described sign language information is converted to text message, and the present image in described text message and described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

The processing unit of the audiovisual information in the video conference provided by the application embodiment of the present invention, after the data code flow decoding that conference terminal sends by processing unit, when decoded data message is sign language information, sign language information is converted to voice messaging, and rear generation synthetic speech information is processed to the voice messaging after conversion; The synthetic speech information of generation is carried out stereo process with other decoded audio-frequency informations; Audio-frequency information after audio mixing is sent at least two meeting-place, thus solve deaf and dumb personage in prior art participate in many ways Remote Video Conference time, the problem that cannot freely effectively exchange.

In addition, the processing unit of the audiovisual information in the video conference that the embodiment of the present invention two provides can also adopt implementation as follows, in order to realize the processing method of the audiovisual information in the video conference in the embodiment of the present invention one, as shown in Figure 8, the processing unit of described information comprises: network interface 810, processor 820 and memory 830.System bus 840 is for interconnection network interface 810, processor 820 and memory 830.

Network interface 810 is for communicating with conference terminal.

Memory 830 can be permanent memory, and such as hard disk drive and flash memory have software module and device driver in memory 830.Software module can perform the various functional modules of said method of the present invention; Device driver can be network and interface drive program.

When starting, these component softwares are loaded in memory 830, are then accessed by processor 820 and perform as given an order:

Receive the data code flow that at least two conference terminals send, and described data code flow is decoded, obtain at least two-way decoded information;

When at least there is sign language information in two-way decoded information described in determining, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after described conversion, generate synthetic speech information;

The synthetic speech information of described generation is carried out stereo process with other decoded audio-frequency informations;

At least two meeting-place described in audio-frequency information after audio mixing is sent to.

Further, after the component software of described processor access memory 830, performing the data code flow receiving at least two conference terminals and send, and described data code flow decoded, before obtaining at least two-way decoded information instruction, performing the instruction of following process:

The customer attribute information that receive user's input or described conference terminal sends;

Described phonetic synthesis process is carried out to the voice messaging after described conversion, generates synthetic speech information and be specially:

According to described customer attribute information, phonetic synthesis process is carried out to the voice messaging after described conversion, generate the synthetic speech information of mating with described customer attribute information.

Further, after the component software of described processor access memory 830, perform the voice messaging after to described conversion and carry out phonetic synthesis process, generate the concrete instruction of synthetic speech information process:

Judge whether the meeting-place number at the described conference terminal place sending described sign language information exceedes first threshold;

If be no more than described first threshold, then the described voice messaging after described conversion is carried out phonetic synthesis process, generate synthetic speech information;

If exceed described first threshold, then the described voice messaging after the described conversion being no more than described first threshold is carried out phonetic synthesis process, generate synthetic speech information.

Further, after the component software of described processor access memory 830, before described sign language information is converted to voice messaging instruction by execution, perform the instruction of following process:

First moment value of each described sign language information obtained after carrying recorded decoding;

Perform the concrete instruction synthetic speech information of described generation being carried out stereo process process with other decoded audio-frequency informations:

Carry out from large to little sequence to described first moment value;

According to the sequence of described first moment value, the synthetic speech information of described generation is carried out gain amplification according to the gain coefficient preset;

Calculate the energy value of other decoded audio-frequency informations described, according to described energy value from large to little sequence, the gain coefficient of other decoded audio-frequency informations described is carried out gain amplification;

Stereo process is carried out, at least two meeting-place described in being sent to by the audio-frequency information after audio mixing to the synthetic speech information of the described generation after gain process and other decoded audio-frequency informations described.

Further, after the component software of described processor access memory 830, the instruction of following process is performed:

Other decoded audio-frequency informations described in participation stereo process are converted to sign language information;

According to the sequence of the energy value of other decoded audio-frequency informations described, according to the ratio preset, the described sign language information after described conversion is carried out convergent-divergent process;

Described sign language information after convergent-divergent process is superposed with the present image in described conference terminal, in order to show at least two meeting-place.

Further, after the component software of described processor access memory 830, the instruction of following process is performed:

Described sign language information is converted to text message, and the present image in described text message and described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.

The processing unit of the audiovisual information in the video conference provided by the application embodiment of the present invention, after the data code flow decoding that conference terminal sends by processing unit, when decoded data message is sign language information, sign language information is converted to voice messaging, and rear generation synthetic speech information is processed to the voice messaging after conversion; The synthetic speech information of generation is carried out stereo process with other decoded audio-frequency informations; Audio-frequency information after audio mixing is sent at least two meeting-place, thus solve deaf and dumb personage in prior art participate in many ways Remote Video Conference time, the problem that cannot freely effectively exchange.

Professional should recognize further, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with electronic hardware, computer software or the combination of the two, in order to the interchangeability of hardware and software is clearly described, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random asccess memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

Above-described embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only the specific embodiment of the present invention; the protection range be not intended to limit the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1. a processing method for the audiovisual information in video conference, is characterized in that, described video conference comprises at least two meeting-place, and each meeting-place described at least comprises a conference terminal, and described method comprises:
Receive the data code flow that at least two conference terminals send, and described data code flow is decoded, obtain at least two-way decoded information;
When at least there is sign language information in two-way decoded information described in determining, first moment value of each described sign language information obtained after carrying recorded decoding, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after conversion, generate synthetic speech information;
Carry out from large to little sequence to described first moment value;
According to the sequence of described first moment value, described synthetic speech information is carried out gain amplification according to the gain coefficient preset;
Other decoded audio-frequency informations described, according to described energy value from large to little sequence, are carried out gain amplification according to gain coefficient by the energy value of other the decoded audio-frequency informations described in calculating at least in two-way decoded information except described sign language information;
Stereo process is carried out, at least two meeting-place described in being sent to by the audio-frequency information after audio mixing to the synthetic speech information of the described generation after gain process and other decoded audio-frequency informations described.
2. the processing method of the audiovisual information in video conference according to claim 1, it is characterized in that, the data code flow that described reception at least two conference terminals send, and being decoded by described data code flow, obtains at least also comprising before two-way decoded information:
The customer attribute information that receive user's input or described conference terminal sends;
Described phonetic synthesis process is carried out to the voice messaging after described conversion, generates synthetic speech information and be specially:
According to described customer attribute information, phonetic synthesis process is carried out to the voice messaging after described conversion, generate the synthetic speech information of mating with described customer attribute information.
3. the processing method of the audiovisual information in video conference according to claim 1, is characterized in that, describedly carries out phonetic synthesis process to the voice messaging after conversion, generates synthetic speech information and is specially:
Judge whether the meeting-place number at the described conference terminal place sending described sign language information exceedes first threshold;
If be no more than described first threshold, then the voice messaging after conversion is carried out phonetic synthesis process, generate synthetic speech information;
If exceed described first threshold, then the voice messaging after the described conversion being no more than described first threshold is carried out phonetic synthesis process, generate synthetic speech information.
4. the processing method of the audiovisual information in video conference according to claim 1, is characterized in that, described method also comprises:
Other decoded audio-frequency informations described in participation stereo process are converted to sign language information;
According to the sequence of the energy value of other decoded audio-frequency informations described, according to the ratio preset, the described sign language information after described conversion is carried out convergent-divergent process;
Described sign language information after convergent-divergent process is superposed with the present image in described conference terminal, in order to show at least two meeting-place.
5. the processing method of the audiovisual information in video conference according to claim 1, is characterized in that, described method also comprises:
Described sign language information is converted to text message, and the present image in described text message and described conference terminal is carried out overlap-add procedure, in order to show at least two meeting-place.
6. a processing unit for the audiovisual information in video conference, is characterized in that, described device comprises:
Decoding unit, for receiving the data code flow that at least two conference terminals send, and decoding described data code flow, obtaining at least two-way decoded information;
Conversion synthesis unit, during for determining to there is sign language information at least two-way decoded information in described decoding unit, first moment value of each described sign language information obtained after carrying recorded decoding, described sign language information is converted to voice messaging, and phonetic synthesis process is carried out to the voice messaging after conversion, generate synthetic speech information;
Stereo process unit, for carrying out from large to little sequence to described first moment value; According to the sequence of described first moment value, described synthetic speech information is carried out gain amplification according to the gain coefficient preset; Other decoded audio-frequency informations described, according to described energy value from large to little sequence, are carried out gain amplification according to gain coefficient by the energy value of other the decoded audio-frequency informations described in calculating at least in two-way decoded information except described sign language information; Stereo process is carried out to the synthetic speech information of the described generation after gain process and other decoded audio-frequency informations described;
Transmitting element, at least two meeting-place described in being sent to by the audio-frequency information after audio mixing in described stereo process unit.
7. the processing unit of the audiovisual information in video conference according to claim 6, is characterized in that, described decoding unit is also for, that receive user's input or that described conference terminal sends customer attribute information;
Described conversion synthesis unit specifically for, according to described customer attribute information, phonetic synthesis process is carried out to the voice messaging after described conversion, generates the synthetic speech information of mating with described customer attribute information.
8. the processing unit of the audiovisual information in video conference according to claim 6, is characterized in that, described device also comprises,
Judging unit, for judging whether the meeting-place number at the described conference terminal place sending described sign language information exceedes first threshold, and sends to described conversion synthesis unit by judged result;
Described conversion synthesis unit specifically for, when receiving described judging unit and judging that described meeting-place number is no more than the judged result of described first threshold, the voice messaging after conversion is carried out phonetic synthesis process, generates synthetic speech information;
Described conversion synthesis unit specifically for, when receiving described judging unit and judging that described meeting-place number exceedes the judged result of described first threshold, voice messaging after the described conversion being no more than described first threshold is carried out phonetic synthesis process, generates synthetic speech information.
9. the processing unit of the audiovisual information in video conference according to claim 6, is characterized in that, described device also comprises:
Sign language converting unit, for being converted to sign language information by other decoded audio-frequency informations described in participation stereo process;
Convergent-divergent processing unit, for the energy value of other decoded audio-frequency informations described in calculating described in basis, carries out convergent-divergent process according to the ratio preset by the described sign language information after described conversion;
Superpositing unit, for being superposed with the present image in described conference terminal by the described sign language information after convergent-divergent process, in order to show at least two meeting-place.
10. the processing unit of the audiovisual information in video conference according to claim 6, is characterized in that, described device also comprises:
Text conversion units, for described sign language information is converted to text message, and carries out overlap-add procedure by the present image in described text message and described conference terminal, in order to show at least two meeting-place.
11. 1 kinds of video conferencing systems, it is characterized in that, described system comprises: at least two meeting-place, and each meeting-place described at least comprises a conference terminal, and the processing unit of audiovisual information in video conference as described in claim arbitrary in claim 6 to claim 10.
CN201210560387.7A 2012-12-21 2012-12-21 The processing method of the audiovisual information in video conference, Apparatus and system CN102984496B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210560387.7A CN102984496B (en) 2012-12-21 2012-12-21 The processing method of the audiovisual information in video conference, Apparatus and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210560387.7A CN102984496B (en) 2012-12-21 2012-12-21 The processing method of the audiovisual information in video conference, Apparatus and system
PCT/CN2013/083170 WO2014094461A1 (en) 2012-12-21 2013-09-10 Method, device and system for processing video/audio information in video conference

Publications (2)

Publication Number Publication Date
CN102984496A CN102984496A (en) 2013-03-20
CN102984496B true CN102984496B (en) 2015-08-19

Family

ID=47858189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210560387.7A CN102984496B (en) 2012-12-21 2012-12-21 The processing method of the audiovisual information in video conference, Apparatus and system

Country Status (2)

Country Link
CN (1) CN102984496B (en)
WO (1) WO2014094461A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984496B (en) * 2012-12-21 2015-08-19 华为技术有限公司 The processing method of the audiovisual information in video conference, Apparatus and system
CN103686059B (en) * 2013-09-23 2017-04-05 广东威创视讯科技股份有限公司 Distributed mixed audio processing method and system
CN107566863A (en) * 2016-06-30 2018-01-09 中兴通讯股份有限公司 A kind of exchange of information methods of exhibiting, device and equipment, set top box
CN108881783A (en) * 2017-05-09 2018-11-23 腾讯科技(深圳)有限公司 Realize method and apparatus, computer equipment and the storage medium of multi-conference

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179693A (en) * 2007-09-26 2008-05-14 深圳市丽视视讯科技有限公司 Mixed audio processing method of session television system
CN101453611A (en) * 2007-12-07 2009-06-10 希姆通信息技术(上海)有限公司 Method for video communication between the deaf and the normal
CN101816191A (en) * 2007-09-26 2010-08-25 弗劳恩霍夫应用研究促进协会 Be used for obtaining extracting the apparatus and method and the computer program that are used to extract ambient signal of apparatus and method of the weight coefficient of ambient signal
CN102246225A (en) * 2008-12-15 2011-11-16 皇家飞利浦电子股份有限公司 Method and apparatus for synthesizing speech

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101309390B (en) * 2007-05-17 2012-05-23 华为技术有限公司 Visual communication system, apparatus and subtitle displaying method
CN101080000A (en) * 2007-07-17 2007-11-28 华为技术有限公司 Method, system, server and terminal for displaying speaker in video conference
CN101115088A (en) * 2007-08-07 2008-01-30 周运南 Mobile phone dedicated for deaf-mutes
CN101594434A (en) * 2009-06-16 2009-12-02 中兴通讯股份有限公司 The sign language processing method and the sign language processing mobile terminal of portable terminal
US8447023B2 (en) * 2010-02-01 2013-05-21 Polycom, Inc. Automatic audio priority designation during conference
CN102387338B (en) * 2010-09-03 2017-04-12 中兴通讯股份有限公司 Distributed type video processing method and video session system
CN102984496B (en) * 2012-12-21 2015-08-19 华为技术有限公司 The processing method of the audiovisual information in video conference, Apparatus and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101179693A (en) * 2007-09-26 2008-05-14 深圳市丽视视讯科技有限公司 Mixed audio processing method of session television system
CN101816191A (en) * 2007-09-26 2010-08-25 弗劳恩霍夫应用研究促进协会 Be used for obtaining extracting the apparatus and method and the computer program that are used to extract ambient signal of apparatus and method of the weight coefficient of ambient signal
CN101453611A (en) * 2007-12-07 2009-06-10 希姆通信息技术(上海)有限公司 Method for video communication between the deaf and the normal
CN102246225A (en) * 2008-12-15 2011-11-16 皇家飞利浦电子股份有限公司 Method and apparatus for synthesizing speech

Also Published As

Publication number Publication date
WO2014094461A1 (en) 2014-06-26
CN102984496A (en) 2013-03-20

Similar Documents

Publication Publication Date Title
US20160321936A1 (en) System and method for hybrid course instruction
US9977574B2 (en) Accelerated instant replay for co-present and distributed meetings
CN105681920B (en) A kind of Network teaching method and system with speech identifying function
US8750678B2 (en) Conference recording method and conference system
CN103703719B (en) The method and apparatus for making the participant in communication session mute
US9560319B1 (en) Audiovisual information processing in videoconferencing
US8416715B2 (en) Interest determination for auditory enhancement
WO2017190434A1 (en) Method for generating statistical information, and server
JP6101973B2 (en) Voice Link system
TWI221746B (en) Preview file generating method applicable on multiple systems and device thereof
US9621851B2 (en) Augmenting web conferences via text extracted from audio content
US6801663B2 (en) Method and apparatus for producing communication data, method and apparatus for reproducing communication data, and program storage medium
CN106471802A (en) Real-time video conversion in video conference
US20100253689A1 (en) Providing descriptions of non-verbal communications to video telephony participants who are not video-enabled
US20110246172A1 (en) Method and System for Adding Translation in a Videoconference
US20090012788A1 (en) Sign language translation system
CN102939791A (en) Hand-held communication aid for individuals with auditory, speech and visual impairments
CN102065265B (en) Method, device and system for realizing sound mixing
US8447065B2 (en) Method of facial image reproduction and related device
CN101228582B (en) Audio reproduction method and apparatus supporting audio thumbnail function
US20070020603A1 (en) Synchronous communications systems and methods for distance education
TW200407807A (en) Remote education system, and method and program for attendance confirmation
JP2008500573A (en) Method and system for changing messages
CN106023983A (en) Multi-user voice interaction method and device based on virtual reality scene
CN103051864B (en) Mobile video session method

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
GR01 Patent grant
C14 Grant of patent or utility model