CN1997161A

CN1997161A - A video terminal and audio code stream processing method

Info

Publication number: CN1997161A
Application number: CN 200610064656
Authority: CN
Inventors: 詹五洲
Original assignee: Huawei Technologies Co Ltd
Current assignee: FUGUE ACOUSTICS TECHNOLOGY CO., LTD.
Priority date: 2006-12-30
Filing date: 2006-12-30
Publication date: 2007-07-11
Anticipated expiration: 2026-12-30
Also published as: CN100556151C

Abstract

This invention discloses one audio frequency code flow process method, which comprises the following steps: decoding the audio frequency compression codes flow to germ the audio source image and testing the said audio source position information in the said image; decoding the said audio compression codes flow to germ sound information; processing the sound information according to the position to make the repeat sound direction and position matching. This invention also discloses one visual frequency terminal.

Description

A kind of video terminal and a kind of audio code stream processing method

Technical field

The present invention relates to mechanics of communication, particularly relate to a kind of video terminal and a kind of audio code stream processing method.

Background technology

Along with popularizing of broadband, in occupation of more and more important position, the video epoch of communication have been opened curtain to video communication in our social life.But, screen of TV set is increasing at present, and the video communications system that has adopts projecting apparatus or video wall to show, the position that causes the participant to move on picture is bigger, and the sound of present multimedia communication system does not change according to speaker's position, be that sound does not have azimuth information, cause video communication to lack the sense of reality.

Prior art discloses a kind of solution to the problems described above: place the device of a long strip type at the television set top, a plurality of microphones are arranged, a plurality of loud speakers, and camera in this device.After the voice signal of a plurality of microphone collections handled, can obtain a voice signal, and the speaker's azimuth information with respect to the long strip type device.The transmitting terminal of video communications system is sent to receiving terminal with voice signal and the speaker's azimuth information that obtains by network, receiving terminal is according to the azimuth information that receives, select one or more loudspeaker plays, just can reappear speaker's azimuth information like this at receiving terminal.

In such scheme, speaker's azimuth information of transmitting terminal collection is with respect to the long strip type device, rather than with respect to camera lens.When rotating camera lens, the speaker in strip device dead ahead is just on the next door of picture, not even within picture, and the sound bearing information of gathering still is the dead ahead, so just causes the position of speaker in the picture and the sound bearing information of collection not to match.

In addition, transmitting terminal need send to receiving terminal by network with azimuth information, if transmitting terminal and receiving terminal are the equipment of different manufacturers, will have the problem of intercommunication, and receiving terminal can not correctly be handled the azimuth information of transmitting terminal in other words.

Summary of the invention

Embodiments of the invention provide a kind of video terminal and a kind of audio code stream processing method, make that transmitting terminal does not need sound source position information is sent to receiving terminal by network, and the sound of playback also can be realized accurate match with the position of source of sound.

A kind of audio code stream processing method is characterized in that, described method specifically comprises:

Compressed video stream is decoded, obtain to comprise the image of source of sound, in described image, detect the positional information of described source of sound;

Compressed video stream corresponding audio compressed bit stream is decoded, obtain voice messaging;

Positional information according to described source of sound is handled described voice messaging, and the sound bearing of playback and the position of described source of sound are complementary.

A kind of video terminal is characterized in that,

The video decode module is used for the compressed video stream that receives is decoded, and the image behind the output decoder;

The audio decoder module is used for the compressed video stream corresponding audio compressed bit stream that receives is decoded, and the voice messaging behind the output decoder;

The sound source position detection module is used for the image that the receiver, video decoder module sends, and extracts the feature of source of sound, thereby detects the positional information of source of sound;

The sound bearing processing module is used to receive the voice messaging of audio decoder module transmission and the sound source position information that the sound source position detection module sends, and the position of sound bearing and source of sound is mated mutually.

Embodiments of the invention are handled the sound of resetting by the positional information of source of sound in the detected image, can be so that mate mutually the position of source of sound in the orientation of the sound of resetting in the loud speaker and the image; Receiving terminal needn't rely on the transmission terminal sound source position information is provided simultaneously.

Description of drawings

Fig. 1 is the method flow diagram of the embodiment of the invention;

Fig. 2 is an application scenarios of the embodiment of the invention;

Fig. 3 is the moving flow chart that detects of lip in the embodiment of the invention;

Fig. 4 is the structure chart of video terminal in the embodiment of the invention.

Embodiment

Embodiments of the invention provide a kind of audio code stream processing method.As shown in Figure 1, this method is made up of following steps:

In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

Describe the present invention with a video conference in detail as an application scenarios of the embodiment of the invention below.But this application scenarios is not used for limiting the present invention.

Fig. 2 is the schematic diagram of video communication system.In Fig. 2, the 10th, the transmitting terminal meeting-place, the 11st, the receiving terminal meeting-place, the 12nd, communication network, communication network can be IP network, PSTN network, wireless network etc.In meeting-place 10, the 101st, camera, the 102nd, video communication terminal, the 103rd, television set, the 104th, participant, the 105, the 106th, loud speaker.Being built-in with microphone in the terminal 102, also can be independently to place the outside, is connected with terminal 112 by transmission line.In meeting-place 11, the 111st, camera, the 112nd, video communication terminal, the 113rd, television set, 104a are participant 104 images, the 115, the 116th, loud speaker.Being built-in with microphone in the terminal 112, also can be independently to place the outside, is connected with terminal 102 by transmission line.After camera 101 in the transmitting terminal meeting-place 10 is caught image, be sent to terminal 102,102 pairs of images of terminal wait after the processing through coding, be transferred to terminal 112 by network 12,112 pairs of image code streams that receive of terminal are decoded, and the image after will decoding is transferred to demonstration on the television set 113.After the microphones capture voice signal in the meeting-place 10, pass to terminal 102, terminal 102 is carried out audio coding, and the audio code stream after will encoding by network 12 is transferred to terminal 112, after the audio code stream decoding that 112 pairs at terminal receives, send loud speaker 115,116 to and reset.

In 11 meeting-place of Fig. 2, have telepresenc in order to make sound, the sound of loud speaker 115,116 playbacks and the position of speaker 104a are complementary.

Below we with in video conference, the artificial source of sound of speaking in the meeting describes method of the present invention as an example:

Step1: the compressed video stream that transmitting terminal is sent carries out video decode, obtains the image of transmitting terminal, detects the positional information of speaker in the image then.

Compressed video stream is carried out video decode, and what obtain is multiple image, then the image in the frame sequence is detected, and obtains speaker's positional information.

Wherein, the method that detects the speaker position has many kinds, for example adopt image recognition technology, with some characteristic of speaker as the publish picture position of speaker in the picture of feature detection, the feature that can be used to detect comprises people's face, eyes, lip etc., below we with speaker's lip as being characterized as example, the positional information of how determining the speaker by the moving position of the lip that detects the speaker is described.

Please refer to the moving detection procedure of lip of Fig. 3.

S11: detect the moving position of lip of present frame, if present frame has lip moving, execution in step S12 then; Otherwise execution in step S14;

S12: judged whether that further a plurality of lips move the position,, then in the moving position of a plurality of lips, selected the moving position of a lip, or calculate the center of the moving position of a plurality of lips and position, execution in step S13 are moved as lip in this center if there are a plurality of lips to move the position; Otherwise, direct execution in step S13;

S13: the moving position of output lip;

S14: do not export the moving position of lip.

The moving position of lip is the position at speaker's lip place.Detect the moving position of lip and can adopt detection method of the prior art.Simple effective method is the color according to lip, and the search of lip look can be carried out at YIQ or YUV color space.For example, in the YIQ space, through statistics and experiment effect, the optimal threshold that obtains each component of lip look is respectively Y ∈ [80,220], I ∈ [12,78], Q ∈ [7,25].Can be relatively easy to search out the position of lip according to these threshold values.If only search for, inevitably can bring some erroneous judgements, thereby can also after searching out the lip position, further judge according to the colour of skin around the lip according to the lip look according to the lip look.The colour of skin also has a threshold range of concentrating relatively, and whether the color of utilizing these threshold ranges can judge the lip periphery is the colour of skin, is correct if the judgement of lip position then is described, otherwise incorrect.Utilizable in addition feature also has eye feature etc.

Need also after judging the position of lip to judge whether lip is kept in motion, this can just can make judgement easily according to the size of the lip of the some two field picture same positions in front and back and the speed that changes.Because the moving position of lip has continuity, therefore do not need every two field picture all in the gamut of image, to detect the moving position of lip, concrete grammar is if former frame has detected the moving position of lip, whether the moving position of lip of then detecting present frame can detection have lip to exist near the moving position of former frame lip, if do not have, then the moving position of search lip in the entire image scope if having, judges further that then whether lip is in motion; If in motion, then the position is moved as lip in the position of motion lip, otherwise, a predetermined frame number is set, all keep lip to move invariant position within the predetermined frame number after present frame, all do not have motion, then restart in the moving position of entire image range searching lip if surpass predetermined frame number lip.Adopt this method can reduce amount of calculation to a great extent, and can guarantee the continuity of sound bearing.

In video communication, particularly in the application of video conference, same meeting-place has a plurality of participants, and this moment is because there are reasons such as the people yawns, little sound words, the moving position of a plurality of lips can be detected, therefore the moving position of a suitable lip need be from the moving position of a plurality of lips, selected.As previously mentioned, if former frame has lip to move the position, near the moving position of the detection lip moving position of former frame lip only then, if therefore detect the moving position of a plurality of lips, also the moving position of search lip just takes place in the entire image scope.The strategy of selecting a lip to move the position from the moving position of a plurality of lips has multiple, for example selects the moving position of positive lip, filters out the moving position of lip of side; Perhaps select near the moving position of the lip in the middle of the picture, and the lip that filters out on the picture limit moves the position.In the meeting-place, also may there be a plurality of speakers simultaneously sometimes, if adopt above-mentioned method all can not select suitable lip move the position, can calculate the center of the moving position of these a plurality of speaker's lips this moment, and the position that this center is moved as the lip of output.

Step2: the compressed audio bitstream stream that transmitting terminal sends is decoded, obtain voice messaging;

The decoding to audio compression code stream and compressed video stream described in Step1 and the Step2 can be carried out simultaneously, also can separately carry out the branch of no sequencing.

Step3: the positional information according to the speaker is handled the voice messaging that receives, and makes speaker's sound bearing and its position be complementary.

Voice messaging is handled in position according to the speaker, can utilize the method for prior art to realize.Describe for example below.Application scenarios for Fig. 2, if what reset is two loud speakers, and two loud speakers are respectively at television set the right and left, an acoustic processing scheme is, by adjusting the amplitude of left and right acoustic channels sound, reach the purpose that the speaker position is complementary in the level orientation of sound and the picture, speaker's position and sound bearing are complementary.Two formula below available are described concrete method of adjustment:

D＝(g1-g2)/(g1+g2)

C＝g1*g1+g2*g2

C is a fixed value in above-mentioned two formulas, g1 is the L channel amplitude gain, g2 is the R channel amplitude gain, D is the relative distance of speaker's horizontal direction on picture of coming out according to the moving positional information calculation of lip, make the moving position of lip apart from the distance of picture intermediate vertical line be D ' (the moving position of lip on the picture left side on the occasion of, the right is a negative value), the width of television image horizontal direction is W, then D is calculated as follows:

D＝D’/(W/2)

Can also adopt HRTF (Head RelatedTransfer Functions, head-related transfer function) according to sound source position information processing sound method.It is all open in existing technical literature to adopt HRTF to fictionalize the technology of a sound source, no longer describes in detail in the present invention.

In the method that embodiments of the invention provide,, make receiving terminal needn't rely on the transmission terminal speaker is provided positional information by detecting on sound reproduction ground and obtaining speaker's positional information; After obtaining positional information, according to this positional information the voice messaging of resetting is handled, thereby accurate match is realized in speaker's position in feasible sound of resetting and the image.

Need to prove that audio code stream processing method provided by the invention not only is confined to handle the audio code stream that receives from transmitting terminal, be equally applicable to handle being stored in local video, audio code stream.

Embodiments of the invention also provide a kind of video terminal.As shown in Figure 4, modules such as video decode, audio decoder, sound source position detection, sound bearing processing are arranged in video communication terminal.Compressed video stream outputs to television set on the one hand and shows after the decoding of video decode module, outputs to the sound source position detection module in addition on the one hand.The image of sound source position detection module receiver, video decoder module output, and image detected, extract the feature of source of sound, thereby obtain sound source position information, and sound source position information is outputed to the sound bearing processing module.Compressed audio bitstream is flowed through after the audio decoder module decoding, outputs to the sound bearing processing module.The audio code stream that the sound bearing processing module is received according to the sound source position information butt joint is handled, and the sound bearing after feasible processing the and the position of source of sound are consistent, and produces left and right sides two-way audio output, is transported to left and right speaker playback respectively.In order to have the better sound replaying effect, video communication terminal can external loud speaker more than three or three, and this moment, the sound bearing processing module was exported the audio stream more than three road or three tunnel accordingly.

The purpose of the sound source position detection module in the video terminal is that the image that the video decode module is exported is detected, and obtains the wherein positional information of source of sound.So when if source of sound is the speaker in video terminal, position probing can realize by the lip feature of extracting the speaker, also can be by detecting speaker's features such as people's face, as long as this module can detect the speaker position in the image of video decode module output.

If the lip with the speaker is the position that feature detects the speaker, then the sound source position detection module comprises:

First receiver module is used for the image that comprises the speaker that the receiver, video decoder module sends;

Characteristic extracting module is used to extract the lip feature of speaker described in the image that described first receiver module receives;

Position detecting module is used for the lip feature according to the described speaker of described characteristic extracting module extraction, determines described speaker's position.

Wherein, detect the moving position of lip and can adopt the moving detection method of the lip of introducing previously.

The sound bearing processing module comprises:

Second receiver module is used to receive the voice messaging of described audio decoder module transmission and the described speaker's that described position detecting module sends positional information;

Matching module is used for according to the voice messaging of described second receiver module reception and described speaker's positional information the sound bearing of playback and described speaker's position being complementary.

In sum, more than be preferred embodiment of the present invention only, be not to be used to limit protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1, a kind of audio code stream processing method is characterized in that, comprising:

2, the method for claim 1 is characterized in that, when described source of sound was the speaker, the described positional information that detects described source of sound in described image was specially:

From described image, extract described speaker's lip feature, go out the moving position of lip, thereby determine described speaker's positional information according to described lip feature detection.

3, method as claimed in claim 2, if detected the moving position of lip in the former frame image that described compressed video stream decoding obtains, then whether near the present frame detection moving position of described former frame lip has lip to exist.

4, method as claimed in claim 2 is characterized in that, when with at least two described voice of speaker playback, described positional information according to described source of sound is handled described voice messaging and is specially:

Adjust the amplitude of described loud speaker left and right acoustic channels sound, the level orientation of sound and described speaker position are complementary.

5, method as claimed in claim 2 is characterized in that, the described positional information that detects described source of sound in described image further comprises:

When having a plurality of lips to move the position in the described image, calculate the center of the moving position of described a plurality of lip, and with the position of this center as the speaker of output.

6, method as claimed in claim 2 is characterized in that, described lip feature comprises the color of lip.

7, method as claimed in claim 6 is characterized in that, after determining the lip position according to the color of lip, judges further whether the color around the lip is the color of skin.

8,, after detecting the lip position, judge that further whether lip is in motion as claim 6 or 7 described methods; If in motion, then the position is moved as lip in the position of motion lip, otherwise, a predetermined frame number is set, all keep lip to move invariant position within the predetermined frame number after present frame, all do not have motion if surpass predetermined frame number lip, then restart the moving position of search lip in the entire image scope.

9, a kind of video terminal is characterized in that,

10, device as claimed in claim 9 is characterized in that, described sound source position detection module comprises:

Position detecting module is used for the lip feature according to described characteristic extracting module extraction, determines described speaker's position.

11, device as claimed in claim 10 is characterized in that, described sound bearing processing module comprises: