WO2011120407A1

WO2011120407A1 - Realization method and apparatus for video communication

Info

Publication number: WO2011120407A1
Application number: PCT/CN2011/072198
Authority: WO
Inventors: 岳中辉
Original assignee: 华为终端有限公司
Priority date: 2010-03-30
Filing date: 2011-03-28
Publication date: 2011-10-06
Also published as: CN102209225B; CN102209225A

Abstract

Embodiments of the present invention disclose a realization method and apparatus for video communication, and the method and apparatus relate to the communication technology field. The method includes: when a local user has established a connection with a remote user, obtaining header position information of the remote user; determining a loudspeaker playback manner corresponding to said remote user according to the header position information of said remote user; when the remote user is speaking, performing playback according to the loudspeaker playback manner corresponding to the speaker. The method and apparatus above enable the direction of the remote user sound heard by the local user to be consistent with the direction of the remote user image watched by the local user, thus improve the user telepresence.

Description

The present invention claims the priority of the Chinese patent application filed on March 30, 2010, the Chinese Patent Application No. 201010137021.X, the name of the invention is "video communication implementation method and device", The entire contents are incorporated herein by reference. Technical field

The present invention relates to the field of communications, and in particular, to a method and an apparatus for implementing video communication. Background technique

The video conferencing service uses multimedia communication technology to hold conferences using audio and video input and output devices and communication networks, and can simultaneously realize image, voice, and data interaction between two or more places. The method for implementing the video communication provided by the prior art is: receiving image and sound data sent by the video conference terminal of another conference site that communicates with the video conference terminal of the conference site, and adopting a two-channel stereo coding and decoding scheme for the voice data. In this site, obtain the sound data of the left channel sent by another site, and play it out from the speaker on the left side of the site to obtain the sound data of the right channel sent by another site, and from the right side of the venue. The speakers are played out.

In the process of implementing the present invention, the inventors have found that the prior art has the following problems:

The prior art solution uses two-channel stereo codec to process the sound data, and the sound picked up by the left channel is transmitted from the left speaker, and the sound picked up by the right channel is transmitted from the right speaker to form a two-channel listening area. . The central sound image of the two channels is unstable, sometimes it is left or right, and the gap between the two images is large, so that the user can only distinguish the left, middle and right directions, and the sound orientation is difficult to be accurate and fine. Summary of the invention

Embodiments of the present invention provide a method and an apparatus for implementing video communication, which can enable a local user in a video communication to hear the location of a remote user's voice and a remote user seen by a local user. The orientation of the image is basically the same, enhancing the user's sense of presence.

An embodiment of the present invention provides a method for implementing video communication, where the method includes:

After the local device establishes a connection with the remote device, obtaining the location information of the remote user's head; determining, according to the location information of the remote user, the speaker playback mode corresponding to the remote user; , according to the speaker's corresponding speaker playback mode for playback.

The embodiment of the present invention further provides an apparatus for implementing video communication, where the apparatus includes: an acquiring unit, configured to acquire, after the local device establishes a connection with the remote device, the head position information of the remote user;

And a playback control unit, configured to determine, according to the location information of the remote user, a speaker playback mode corresponding to the remote user; when the remote user speaks, perform playback according to a speaker playback mode corresponding to the speaker.

The present invention further provides a system for implementing video communication, the system comprising: a remote device, a local device, and a media server;

The remote device is configured to collect video and audio data of the remote user and send the data to the media server;

The local device is configured to: after the local user establishes a connection with the remote user, determine, according to the acquired location information of the remote user, a speaker playing mode of the remote user; when the remote user speaks, according to the speaker The speaker playback mode is used for playback.

The present invention further provides a video communication system, the system comprising: a remote device, a local device, and a multipoint control unit media server;

The remote device is configured to collect video and audio data of the remote user and send the data to the media server. The media server is used to exchange video and audio data between the remote device and the local device, and between the local user and the remote user. After the connection is established, the speaker playback mode corresponding to the remote user is determined according to the acquired location information of the remote user; when the remote user speaks, the playback command is sent to the local device according to the speaker playback mode corresponding to the speaker; The local device is configured to control the local playback device to play according to the playback command. It can be seen from the technical solution provided by the foregoing that the technical solution of the embodiment of the present invention obtains the header information of the remote user after establishing the connection between the local device and the remote device, and establishes the corresponding information according to the head information of the remote user. The speaker playing mode controls the playing of the speaker by the playing method, so that the local user can hear the position of the far-end user's voice and the orientation of the image of the far-end user seen by the local user substantially consistent, thereby enhancing the user's sense of presence. DRAWINGS

1 is a flowchart of a method for implementing video communication according to the present invention;

2 is a diagram of a flat panel speaker array according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a flat panel speaker array according to an embodiment of the present invention; FIG.

FIG. 4 is a flowchart of a method for implementing video communication according to an embodiment of the present invention; FIG. 5 is a flowchart of a method for implementing video communication according to another embodiment of the present invention; A flowchart of a method for implementing video communication provided by an embodiment; FIG. 7 is a structural diagram of an apparatus for implementing video communication according to the present invention;

FIG. 8 is a structural diagram of a system for implementing video communication according to the present invention; FIG.

FIG. 9 is a technical scenario diagram of a method according to Embodiment 1 of the present invention;

10 is a schematic diagram of the upper and lower arrangement of the speaker provided by the present invention;

FIG. 11 is a schematic diagram showing the left and right settings of the speaker provided by the present invention. detailed description

An embodiment of the present invention provides a method for implementing video communication. The method is as shown in FIG. 1 and includes the following steps:

Sll. After the local device establishes a connection with the remote device, the location information of the remote user is obtained. The video communication device of the local site and the video communication device of the remote site establish a connection through the network. The local site is referred to as "the local end" and the remote site is referred to as the "remote". The method for obtaining the location information of the remote user's head may be obtained by using an image processing method, such as: face recognition technology, to obtain the location information of the remote user's head; or manually obtaining the head of the remote user. Part location information, that is, by assigning a fixed location to the far-end participant, and thus the area information of the head position itself is determined.

512. Determine, according to the location information of the remote user, a speaker playing manner corresponding to the remote user.

513. When the remote user speaks, play the sound according to the speaker playing mode corresponding to the speaker. Optionally, the foregoing specific method for determining the speaking of the remote user may be performed by, for example, using a face recognition technology to determine a speaker in the remote user for the image of the remote user, or by using a media server. The Multipoint Control Unit (MCU) is used as an example to determine the speaker of the remote user through the audio stream transmitted by the remote microphone.

The specific method for the media server to determine the speaker of the remote user by using the audio stream transmitted by the remote microphone may be as follows: Here, the remote user is three people, for example, the number of users in the actual situation may also be other numbers. When the user is 3 people, the remote site sets up a microphone for each of the 3 participants, for example, assigning a microphone 1 to user A, assigning a microphone 2 to user B, and assigning a microphone 3 to user C; if the media server receives the microphone 1 When the audio stream is transmitted, it is confirmed that the user A speaks. Similarly, when the media server receives the code stream of the microphone 2, it confirms that the user B speaks, and when the media server receives the code stream of the microphone 3, it confirms that the user C speaks and passes. The correspondence between the microphone and the participant determines the speaker of the speech.

The manner of confirming the user's speech in the above example is only an example for implementing the present invention. In practical applications, the present invention does not limit the specific method of confirming the user's speech, as long as it can confirm the user's speech.

Optionally, the method for performing the sound reproduction according to the speaker playing mode corresponding to the speaker may be: the local device controls the sounding device corresponding to the speaker to play according to the speaker playing mode corresponding to the speaker; the method may also be The media server sends a playback command to the local device according to the speaker playing mode corresponding to the speaker, and the local device controls the speaker corresponding according to the playback command. The playback device plays the sound.

Optionally, when the speaker is a flat panel speaker array, the method for implementing S12 and S13 may specifically be:

The speaker in the corresponding flat panel speaker array is confirmed according to the head position information of the remote user, and when the remote user speaks, the speaker corresponding to the speaker is activated to play the sound.

Optionally, when the speaker is set up and down, the method for implementing the S12 and the S13 may be: displaying the image of the remote user up and down, and calculating the vertical distance from the center of the remote user's head position to the center of the display image. Calculating a ratio of the vertical distance to the total height of the displayed image;

When the upper and lower speaker volume difference is 0, the output effect of the speaker is that the sound is output from the middle direction of the up and down speakers;

According to the binaural stereo theory, when the difference between the upper and lower speaker volume is greater than or equal to 15 dB, the sound heard by the user is output from the upper speaker; when the upper and lower speaker volume difference is less than or equal to -15 dB, the lower speaker volume is greater than the above When the speaker is 15dB, the sound heard by the user is output from the lower speaker; when the difference between the upper and lower speaker volume is between (-15~+15), the heard sound is output from a certain height in the middle of the upper and lower speakers. The position corresponding to the common output of the upper and lower speakers can be equivalent to a virtual sound source.

Specifically, the difference between the volume of the upper and lower speakers and the position of the virtual sound source can be roughly evaluated by the following formula:

The difference between the upper speaker and the lower speaker volume = 8X* (0.5 - the ratio of the vertical distance to the total height of the displayed image) dB (Equation 1);

The specific explanation of the above formula: 8 means that the height corresponding to the entire display device is divided into 8 equal parts, and the virtual sound source falls into a certain interval of 8 points divided. Due to the auditory characteristics of the human ear, it is difficult to feel the subtle human ear, so the corresponding height of the entire display device is divided into 8 equal parts. It will be appreciated that one of ordinary skill in the art can employ other methods depending on the height of the different display devices and the characteristics of the sound source;

In the formula: 8 x ( 0.5 - the ratio of the vertical distance to the total height of the displayed image) is the virtual sound source Distance from the upper speaker (the unit indicates the specific number of copies); for example, the total height of the display device

100cm, the human head is at 75cm, then the vertical distance is 25cm, the total height of the display device is lOOcmm, 8 x (0.5 - 25/100) = 8 x 2/8 = 2, indicating that the volume of the above speaker is lower than the following The speaker is 2 points larger. The parameter X in the formula (1) indicates the number of dBs that the virtual sound source needs to adjust to deviate from the two equal parts of the two speakers. The value is related to the height of the display device and the distance between the user and the display device. It is difficult to give a specific Formula, so only the range can be given for user adjustment, X value range is [0,15dB];

And adjusting the volume of the upper and lower speakers according to the difference, and then playing the sound; the specific operation of adjusting the sound reproduction is described below with a specific example. It is assumed that the difference between the upper and lower speakers is 3dB, the volume of the upper speaker is controlled to 73 dB, and the volume of the lower speaker is 70 dB. The value of the upper speaker is the reference volume. The reference volume can be set by the user. For the above 63 dB, of course, it can be 53dB, 60 dB, and so on. Of course, the above difference can also be -3 dB. When it is -3 dB, the control method can be: the volume of the upper speaker is controlled to 70 dB, and the volume of the speaker under control is 73 dB. The volume of the upper speaker here is also the reference. The volume, the specific volume value can also be set by the user.

The above X is the sound coefficient set by the user.

The center and total height of the displayed image have different designs depending on how the image is displayed. When the image display mode adopts projection, the center and the total height of the display image are respectively the center of the projected image and the total height of the projected image; when the image display mode is displayed by the display, the center and the total height of the display image are respectively the display panel. The height of the center and display panel.

Optionally, when the local speaker is set to the left or the right, the method for implementing S12 and S13 may specifically be:

Displaying the image of the remote user to the left and right, and calculating the horizontal distance from the center position information center of the remote user to the center of the display image, and calculating the ratio of the horizontal distance to the total width of the displayed image; the volume of the left speaker and the right speaker Difference = 8 X * (0.5 - the ratio of the horizontal distance to the total width of the displayed image) dB (Equation 2); The parameters in Equation 2 are similar to the parameter definitions in Equation 1. For this reason, this example will not be described again. According to the difference, the volume of the left and right speakers is adjusted and then played; the following is an example to illustrate the specific operation of adjusting the playback. Assuming that the difference between the left and right speakers is 4 dB, the volume of the left speaker is controlled to be 44 dB. The volume of the right speaker is 40 dB, and the value of the left speaker volume is the reference volume. The reference volume can be set by the user, for example, 44 dB above, or 54 dB, 60 dB, etc. Of course, the above difference can also be -4 dB. When it is -4 dB, the control method can be: control the volume of the left speaker to be 40 dB, and control the volume of the right speaker to be 44 dB, where the volume of the left speaker is also the reference. The volume, the specific volume value can also be set by the user.

The above X is the sound coefficient set by the user.

The method provided by the present invention determines the corresponding speaker playing mode according to the head position information of the remote user. When the remote user speaks, the speaker corresponding to the speaker is played to play the sound, and the local user hears the remote user. The orientation of the sound is substantially consistent with the orientation of the image of the far-end user seen by the local user, which enhances the user's sense of presence.

In order to clarify the implementation of the present invention, the following is a description of the specific embodiments: Embodiment 1: This embodiment provides a method for implementing video communication, and the technical scenario is implemented in the local device. The system consists of the media server and the remote device. The specific implementation scenario is shown in Figure 9. The video and audio collection devices A, B, C, D, and E are responsible for collecting remote users A, B, and C, respectively. D and the video and audio data of the local user E), wherein the media server (in FIG. 9, which refers to the MCU) completes the exchange of video data and audio data between the remote device and the local device, and the remote device collects the remote user's Video data and audio data, and sent to a media server (MCU); wherein the remote device may be one or more; the local user and the remote user perform video communication with the input device through audio and video output, wherein, the audio input The device is a microphone or a microphone array, the audio output device is a speaker or a speaker array, and the video input device is a camera or Array camera, the video output device is a display or display arrays. The display device in this embodiment takes a projector as an example, and sets a flat panel speaker array on the projection plane (as shown in FIG. 2 or FIG. 3; wherein numbers 1 to 36 in FIG. 2 respectively indicate an area allocated by the array and corresponding speakers. Edit No. 1 to 9 in Figure 3 indicate the area allocated by the array and the corresponding speaker number. In Figure 9, it is assumed that there are 4 remote users, which are set to A, B, C, D, respectively. The method is as shown in FIG. 4 , and only the flat panel display shown in FIG. 2 is taken as an example, and the following steps are included:

541. After establishing a connection between the local site and the remote site, the remote device starts the face recognition technology to determine the head position information of each user of A, B, C, and D.

The above method for determining the position information of the heads of the eighth, B, C and D heads is only described by taking the face recognition technology as an example. In practical applications, other methods such as manually confirming the A, B, C and D heads can be used. Location information or use other identification technologies (eg, iris detection technology, etc.), such as: According to the ergonomic point of view, the location information of the participants at the venue, the invention is not limited to determine A, B, C and D The specific method of head position information.

Optionally, when performing this step, the preferred method is to directly collect the participant image information of the remote site through the remote end, and use the face recognition technology to determine the location information of each participant.

542. Determine, according to each head position information of A, B, C, and D, a position of a speaker in the flat speaker array corresponding to each of the heads of the A, B, C, and D;

The specific method for implementing S42 may be: as shown in FIG. 2, the panel speaker array is divided into 36 regions according to the number of speakers, and the head position information of A is determined by the face recognition technology to be located in the region 11 as shown in FIG. Then, it is confirmed that the speaker corresponding to the head of the A is the speaker 11; similarly, the speakers corresponding to the heads of the C and D are respectively: the speakers 13, 15, and 17. In a practical situation, when the user's head region spans multiple regions, it may also occur that the head location information determined by the face recognition technology is located in a plurality of regions as shown in FIG. 2, such as the region 10 of FIG. 11, or the regions 21, 22, and 23, at this time, it is confirmed that the speaker corresponding to the A head is the speaker corresponding to all the regions corresponding to the A head position information, for example, the regions 10 and 11 corresponding to the head, and the speaker is determined to be the speaker 10 And 11, when the areas 21, 22, and 23 corresponding to the head, the speakers are determined to be the speakers 21, 22, 23.

543. When the remote user speaks, start the speaker corresponding to the speaker to play the sound; for example, A hair In other words, start the speaker corresponding to A for playback.

The above methods for knowing the voice of the remote user are various, and the mouth shape change of the face can be detected by the manual identification method, and the remote user can be detected by the audio collection mode.

Optionally, the foregoing method for controlling the local sound output device may be performed by a local conference device, or may be performed by a media server (in FIG. 9, corresponding to the MCU);

When the local device is configured, the remote device sends the user image information of the remote site to the local conference device through the media server. The local device establishes the correspondence between the participant information of the remote site and the local voice output device. Relationship: When the participant at the remote site speaks, the speaker is determined by the local face recognition, and then the local sound output device is controlled by the local device. In this embodiment, by controlling the remote speaker corresponding The speaker of the local speaker array emits sound to realize the control of the speaker array, so that the orientation of the local user to hear the far-end user's voice is consistent with the orientation of the image of the far-end user seen by the local user, thereby achieving the effect of increasing the user's presence. ;

When the information is obtained by the media server, the information of the sound output device of the local conference terminal is determined by the media server, and the information may include: the type, the number, the arrangement manner of the sound output device, and the like, and the user at the remote site is obtained. After the image information is obtained, the head information of the remote user is obtained according to the image information, and the correspondence between the head information of the remote user and the sound output device of the local end is established for the local site. In addition, when a user at the remote site speaks, the media server detects the location of the sound source of the remote site sent by the remote site, and then according to the correspondence between the head information of the remote user and the sound output device of the local end. , determine the output of the corresponding speaker in the sound output device of the local end. With the embodiment, the corresponding processing and control functions can be implemented in the media server, and the orientation of the local user to hear the voice of the remote user is consistent with the orientation of the image of the remote user seen by the local user, and the presence of the user is increased. The effect of the sense, but also reduces the complexity of the local device to achieve this solution.

The method provided in this embodiment determines the speaker corresponding to the head position information according to each head position information of A, B, C, and D. When the remote user speaks, the speaker corresponding to the speaker is activated. The playback of the line achieves the purpose that the local user hears the location of the far-end user's voice and the orientation of the image of the far-end user that the local user sees, which increases the user's sense of presence.

Another embodiment of the present invention provides a video communication implementation method, which is implemented in the following manner: The method provided in this embodiment is implemented between a system consisting of a local device, a media server, and a remote device, where the media server The video and audio data of the remote device and the local device are exchanged, and the remote device collects video and audio data of the remote user and sends the data to the media server; the local user and the remote user perform video communication through the display device, and the foregoing display device Can be: CRT display, LCD, plasma display, etc. Assume that the upper and lower center positions of the display device are respectively set with a speaker (as shown in FIG. 10). Of course, the setting of the speaker can also deviate from the center line position of the display device. The speaker is set to the left position of the display device, and the lower speaker is set to the liquid crystal. The display device of the television has a right-to-right position. When the present invention is set up and down, it does not limit the left and right positions of the speaker. It is only necessary to ensure that one speaker is arranged above and below the display device; The user has 4 people, respectively set to ABCD, and the local user is set to E; assuming that the ABCD avatars are arranged in the order from top to bottom: ABCD; the avatar position in this embodiment refers to the mouth center position of the avatar; The above method can be as shown in FIG. 5, and includes the following steps:

Step 51: After the local site establishes a connection with the remote site, the remote device determines the head location information of the remote users A B C and D according to the face recognition.

Step 52: Calculate a vertical distance from the center of the respective head position to the center of the display image (the center of the display device) according to the orientation of the ABCD avatar, and calculate a ratio of the vertical distance to the total height of the display image (ie, the total height of the display image of the display device);

Step 53: When the remote user speaks, adjust the volume of the up and down speakers according to the ratio corresponding to the speaker, and play the sound according to the adjusted volume.

The specific adjustment method may be as follows: the ratio of the ratio corresponding to ABCD is: 0.125 0.375 0.625 0.875; then the difference between the volume of the corresponding upper and lower speakers calculated according to the above formula 1 (where X=3) is: 9dB 3dB 3dB and a 9dB. Of course the X in Formula 1 When taking other values, the corresponding difference can also be other values. For example, when X=2, the calculated differences are: 6dB 2dB 2dB and a 6dB. In actual cases, the value of X can also be other. The value, here the user can set the specific value of X.

After the user sets a reference volume value, for example, the volume value of the upper speaker is set to the reference volume value, which may be 40 dB, the volume of the upper speaker is controlled to be 40 dB, and the volume of the lower speaker is 43 dB (where X=3, The ratio is 0.625) or 38 dB (where X=2, the ratio is 0.375). Of course, in the actual situation, it can also be other volume values.

The technical effects of the embodiment are described below by using the principle implemented in this embodiment. It is proved by experiments that when the human ear hears two sound source pronunciations (for example, up and down), the actual perceived sound is sent out for one location. We generally call this location a virtual sound source. For example, when the volume of the two sound sources is the same, the synthesized virtual sound source is the center position of the two sound sources. If the sound source is set to the upper and lower settings, the volume of the upper sound source is large, then the synthesized virtual sound is The source is close to the position of the sound source. Similarly, if the volume of the lower sound source is large, the position of the synthesized virtual sound source is close to the position of the lower sound source. Therefore, when the situation occurs in the embodiment, the speaker may be in a position to adjust the position of the synthesized virtual sound source by controlling the volume of the upper and lower sound sources (the speaker in this embodiment). When the position of the virtual sound source is modulated to the image position of the speaker, the local user hears the orientation of the far-end user's voice substantially consistent with the orientation of the image of the far-end user seen by the local user, thereby increasing the effect of the user's presence.

The method provided by the embodiment calculates the ratio of the vertical distance of the avatar to the center of the display image and the total width of the display image according to the head position information of the ABC and the D, and controls the volume of the upper and lower speakers according to the ratio, thereby performing playback. The purpose of the local user to hear the sound of the remote user is consistent with the orientation of the image of the remote user seen by the local user, which increases the user's sense of presence.

In another embodiment, when the speaker of the display device is horizontally set, the avatar may be displayed horizontally, and the ratio is modified to a ratio of the horizontal distance of the avatar to the center of the display image to the total width of the displayed image, and then according to formula 2 Perform the calculation of the volume difference. The horizontal setting speaker may be configured to respectively set a speaker at the left and right center line positions of the display device (as shown in FIG. 11); Of course, the setting of the speaker can also be deviated from the center line position of the display device, such as the left speaker is set at the upper position of the display device, and the right speaker is set at the lower position of the display device. The present invention is not limited in the horizontal setting. For the upper and lower specific positions of the speaker, just set a speaker on the left and right of the display device.

When the human ear hears two sound sources (such as left and right), the actual perceived sound is sent out in one place. We generally call this location a virtual sound source, for example, when two sound sources When the volume of the sound is the same, the synthesized virtual sound source is the center position of the two sound sources. If the sound source is set to the left and right, the volume of the left sound source is large, and the synthesized virtual sound source is close to the left sound source. Similarly, if the volume of the right sound source is large, the synthesized virtual sound source position is close to the position of the lower right sound source. Therefore, when the situation of the present embodiment occurs, the following may be specifically: when the speaker speaks, the position of the synthesized virtual sound source may be adjusted by controlling the volume of the left and right sound sources (the speaker in this embodiment). When the position of the virtual sound source is modulated to the image position of the speaker, the local user hears the orientation of the far-end user's voice substantially consistent with the orientation of the image of the far-end user seen by the local user, thereby increasing the effect of the user's presence.

The present invention provides a further embodiment. The present embodiment is implemented between a system consisting of a local device, a media server, and a remote device. The media server completes the exchange of video and audio data between the remote device and the local device, and the remote device collects the data. The video and audio data of the remote user are sent to the media server; the local user and the remote user perform video communication through projection, and set a flat panel speaker array on the projection plane (as shown in Fig. 2), which assumes that the remote user has 4 people. , respectively, set to A, B, C, D, the remote device assigns microphones 1, 2, 3, 4 to A, B, C, D respectively; the local user is set to E; then the above method can be as shown in Figure 6. Show, including:

S61. After the local site establishes a connection with the remote site, the remote device uses the face recognition method to determine the head position information of the users A, B, C, and D.

The method for implementing the S61 may be specifically as follows: The method for determining the position information of the heads of the eighth, B, C, and D is only described by taking the face recognition method as an example, and other methods, such as manually confirming A, may be used in practical applications. , B, C, and D head position information or use other identification techniques, such as: The ergonomic point of view determines the location information of the participants at the venue, and the present invention does not limit the specific method of determining the location information of the user's A, B, C, and D heads.

562. The local device determines, according to the head position information of the A, B, C, and D, the positions of the speakers in the flat speaker array corresponding to the heads of the A, B, C, and D respectively;

The specific method for implementing S52 may be: as shown in FIG. 2, the panel speaker array is divided into 36 regions according to the number of speakers, and the face recognition information of A is determined by the face recognition method to be located in the region 11 as shown in FIG. Then, it is confirmed that the speaker corresponding to the head of the A is the speaker 11; similarly, the speakers corresponding to the heads of the C and D are respectively: the speakers 13, 15, and 17. In an actual case, it may also occur that the head position information of A is determined by the face recognition method to be located in a plurality of areas as shown in FIG. 2, such as areas 10 and 11 of FIG. 2, or areas 21, 22, and 23, , confirming that the speaker corresponding to the A head is the speaker corresponding to all the regions corresponding to the A head position information, for example, the regions 10 and 11 corresponding to the head, determining that the speaker is the speakers 10 and 11, such as the regions 21 and 22 corresponding to the head. And at 23 o'clock, the speaker is determined to be the speakers 21, 22, 23.

563. The media server determines, according to the audio code stream sent by the microphone 1, that the audio stream sent by the microphone 1 and the information that determines that A is the speaker are sent to the local device.

The actual method for implementing S63 may be as follows: Since the remote users have A, B, C, and D, respectively, respectively, the microphones 1, 2, 3, and 4 are allocated, the media server establishes the correspondence between the microphone 1 and the user A, and the same reason is established. Correspondence between the microphone 2 and the user B, the correspondence between the microphone 3 and the user C, and the correspondence between the microphone 4 and the user D, when the media server detects the audio stream sent by the microphone 1, according to the microphone 1 and the user A Corresponding relationship, the user A is determined to speak, and the audio code stream sent by the microphone 1 and the information determining that A is the speaker are sent to the local device.

564. The local device starts A corresponding speaker to play the audio code stream sent by the microphone 1. Optionally, the steps performed by the foregoing local device may be performed by the media server to control the local device. The local device in the method provided by the embodiment determines the speaker corresponding to the head position information according to the head position information of A, B, C, and D. When the media server determines the speaker, the local device starts the speaker corresponding to the speaker. The speaker plays the sound, which achieves the purpose that the local user hears the sound of the far-end user and the orientation of the image of the far-end user seen by the local user is substantially the same, which increases the user's sense of presence.

The present invention also provides an apparatus for implementing video communication, which is shown in FIG. 7, wherein the dotted line module represents an optional module, and the apparatus specifically includes:

The obtaining unit 71 is configured to acquire, after the local user establishes a connection with the remote user, the location information of the remote user.

The playing control unit 72 is configured to determine, according to the head position information of the remote user, a speaker playing mode corresponding to the remote user; when the remote user speaks, play the sound according to a speaker playing mode corresponding to the speaker.

Optionally, when the speaker is a flat panel speaker array, the sound emission control unit 72 includes: an array module 721, configured to confirm a speaker in the corresponding flat panel speaker array according to the head position information of the remote user ,

The playing module 722, when the remote user speaks, starts the speaker corresponding to the speaker to play.

Optionally, when the speaker is set up and down, the sound emission control unit 72 includes: a height calculation module 723, configured to display an image of the remote user up and down, and calculate a center position of the remote user head to display the image a vertical distance of the center, and calculating a ratio of the vertical distance to the total height of the displayed image;

The vertical playback module 724 is configured to adjust the volume of the upper and lower speakers according to the volume difference between the upper and lower speakers; and the method for calculating the difference between the upper speaker and the lower speaker can be referred to the description in Equation 1. .

Optionally, when the speaker is set to the left or the right, the sound emission control unit 72 includes: a width calculation module 725, configured to display the image of the remote user to the left and right, and calculate the remote user header Calculating the ratio of the horizontal distance to the total width of the displayed image by the horizontal distance from the center of the position to the center of the display image;

The horizontal playback module 726 is configured to play the volume adjustment of the left and right speakers according to the volume difference between the left and right speakers; the difference between the volume of the left speaker and the right speaker can be seen in the description in Equation 2.

Optionally, the device may be a separately existing device. Of course, the device may also be installed in the local device. Of course, in actual situations, the device may also be installed in the media server.

The device provided by the present invention determines the corresponding speaker playing method according to the head position information of the remote user. When the remote user speaks, the speaker corresponding to the speaker is played to play the sound, and the local user hears the remote user. The orientation of the sound is substantially consistent with the orientation of the image of the far-end user seen by the local user, which increases the user's sense of presence.

The present invention also provides a system for implementing video communication. The system is as shown in FIG. 8 and includes: a remote device 81, a local device 82, and a media server 83;

The remote device 81 is configured to collect video and audio data of the remote user and send the data to the media server.

83 ;

The media server 83 is configured to exchange video and audio data of the remote device 81 and the local device 82;

The local device 82 is configured to: after the local user establishes a connection with the remote user, determine, according to the acquired location information of the remote user, a speaker playing mode corresponding to the remote user; when the remote user speaks, according to the speaker The speaker playback mode is used for playback.

The local device 82 in the system provided by the present invention can determine the corresponding speaker playing mode according to the head position information of the remote user. When the remote user speaks, the speaker corresponding to the speaker is played to play the sound, and the local mode is reached. The user hears that the orientation of the remote user's voice is substantially consistent with the orientation of the image of the remote user as seen by the local user, which increases the user's sense of presence.

The present invention also provides another video communication system, the system comprising: a remote device, a local device, and a media server; The remote device is configured to collect video and audio data of the remote user and send the data to the media server. The media server is used to exchange video and audio data between the remote device and the local device. The media server is also used for local users and remote users. After the connection is established, the end user determines the speaker playback mode corresponding to the remote user according to the acquired location information of the remote user; when the remote user speaks, sends the speaker to the local device 82 according to the speaker playback mode corresponding to the speaker. The local command is used to control the local playback device to play according to the playback command.

The media server in the system provided by the present invention can determine the corresponding speaker playing mode according to the head position information of the remote user. When the remote user speaks, the speaker corresponding to the speaker is played to play the sound, and the local user is reached. The purpose of hearing the location of the far-end user's voice is basically the same as the orientation of the image of the far-end user seen by the local user, which increases the user's sense of presence.

A person skilled in the art can understand that the drawings are only a schematic diagram of a preferred embodiment, and the modules or processes in the drawings are not necessarily required to implement the invention.

A person skilled in the art can understand that all or part of the steps of implementing the foregoing embodiments may be performed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, when executed, Include one of the steps of the method embodiments or a combination thereof.

In summary, the technical solution provided by the specific embodiment of the present invention has the same orientation that the local user hears the voice of the remote user and the orientation of the image of the remote user that the local user sees, which increases the user's presence. advantage.

A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, and the above mentioned storage medium may be It is a read-only memory, a disk or a disc.

The information interaction method and the interface control system provided by the present invention are described in detail above. For those skilled in the art, according to the idea of the embodiment of the present invention, there are some changes in the specific implementation manner and application scope. Therefore, the content of the specification should not be construed as limiting the invention.

Claims

Claim

A method for implementing video communication, the method comprising:

After the connection between the local device and the remote device, the location information of the remote user is obtained; and the speaker playback method corresponding to the remote user is determined according to the location information of the remote user;

When the remote user speaks, the sound is played according to the speaker playing mode corresponding to the speaker.

The method according to claim 1, wherein the speaker position mode corresponding to the remote user is determined according to the head position information of the remote user; when the remote user speaks, according to the speaker corresponding Speaker playback mode for playback includes:

When the speaker is a flat panel speaker array, the speaker in the corresponding flat panel speaker array is confirmed according to the head position information of the remote user, and when the remote user speaks, the speaker corresponding to the speaker is activated. play music.

The method according to claim 1, wherein the speaker playing mode corresponding to the remote user is determined according to the head position information of the remote user; when the remote user speaks, according to the speaker corresponding Speaker playback mode for playback includes:

When the speaker is set up and down, the image of the remote user is displayed up and down, and the vertical distance from the center of the remote user's head position to the center of the display image is calculated, and the vertical distance and the total height of the displayed image are calculated. Ratio

The difference between the upper speaker and the lower speaker volume=8X* (0.5—the ratio of the vertical distance to the total height of the display image) dB; and the volume of the upper and lower speakers is adjusted according to the difference; The above X is the sound coefficient set by the user.

The method according to claim 1, wherein, according to the head position information of the remote user, the speaker playing mode corresponding to the remote user is determined; when the remote user speaks, according to the speaker corresponding to Speaker playback mode for playback includes:

When the speaker is set to the left and right, the image of the remote user is displayed to the left and right, and the horizontal distance from the center of the head of the remote user to the center of the display image is calculated, and the horizontal distance and the display are calculated. The ratio of the total width of the image;

The difference between the volume of the left speaker and the right speaker = 8X* (0.5 - the ratio of the horizontal distance to the total width of the displayed image) dB; and the volume of the left and right speakers is adjusted according to the difference; The above X is the sound coefficient set by the user.

5. A device for implementing video communication, the device comprising:

An obtaining unit, configured to obtain a location information of a remote user after the local user establishes a connection with the remote user;

The playback control unit is configured to determine a speaker playing method corresponding to the remote user according to the head position information of the remote user; and when the remote user speaks, play the sound according to the speaker playing mode corresponding to the speaker.

The device according to claim 5, wherein when the speaker is a flat panel speaker array, the playback control unit comprises:

a location confirmation module, configured to confirm, according to the location information of the remote user, a speaker in the corresponding panel speaker array,

The playing module, when the remote user speaks, starts the speaker corresponding to the speaker to play the sound.

The device according to claim 5, wherein when the speaker is set up and down, the playback control unit comprises:

a height calculation module, configured to display an image of the remote user up and down, and calculate a vertical distance from the center of the remote user's head to the center of the display image, and calculate a ratio of the vertical distance to the total height of the displayed image;

Vertical playback module, used to adjust the volume of the upper and lower speakers according to the volume difference between the upper and lower speakers; the difference between the upper speaker and the lower speaker volume = 8X* (0.5 - the vertical distance and The ratio of the total height of the displayed image) dB; the above X is the sound coefficient set by the user.

The device according to claim 5, wherein when the speaker is set to the left or the right, the sound emission control unit comprises:

a width calculation module for displaying the image of the remote user to the left and right, and calculating the head portion of the remote user Centering to a horizontal distance of the center of the display image, calculating a ratio of the horizontal distance to the total width of the displayed image;

Horizontal playback module, used to adjust the volume of the left and right speakers according to the volume difference between the left and right speakers; the difference between the left speaker and the right speaker volume = 8X* (0.5 - the horizontal distance and the position The ratio of the total width of the displayed image) dB; The above X is the sound coefficient set by the user.

9. A system for implementing video communication, the system comprising: a remote device, a local device, and a multipoint control unit media server;

10. A video communication system, the system comprising: a remote device, a local device, and a multipoint control unit media server;

The remote device is configured to collect video and audio data of the remote user and send the data to the media server. The media server is used to exchange video and audio data between the remote device and the local device, and between the local user and the remote user. After the connection is established, the speaker playback mode corresponding to the remote user is determined according to the acquired location information of the remote user; when the remote user speaks, the playback command is sent to the local device according to the speaker playback mode corresponding to the speaker;

The local device is configured to control the local playback device to play according to the playback command.