WO2014026478A1

WO2014026478A1 - Video conference signal processing method, video conference server and video conference system

Info

Publication number: WO2014026478A1
Application number: PCT/CN2013/072264
Authority: WO
Inventors: 郑瑞琴
Original assignee: 华为技术有限公司
Priority date: 2012-08-16
Filing date: 2013-03-07
Publication date: 2014-02-20
Also published as: CN102833520A

Abstract

Discloses are a video conference signal processing method, a video conference server and a video conference system. The method comprises: sending a video stream of a selected conference hall to a first conference hall terminal according to a selection instruction sent by the first conference hall terminal; determining whether a first mixed audio stream currently played by the first conference hall terminal comprises the audio stream of the selected conference hall or not; and if not, generating a second mixed audio stream and sending the second mixed audio stream to the first conference hall terminal, the second mixed audio stream comprising corresponding part or all of the audio stream of the selected conference hall. Therefore, the audio stream played by the first conference hall terminal follows a video stream, so that asynchronization of audios and videos is improved to a certain degree, and user experience is improved.

Description

Method for processing video conference signal, video conference server and system The application is submitted to the Chinese Patent Office on August 16, 2012, and the application number is 201210292177.4, and the invention name is "a video conference signal processing method and a video conference server. The priority of the Chinese Patent Application, the entire disclosure of which is incorporated herein by reference.

Technical field

The present invention relates to the field of communications technologies, and in particular, to a video conference signal processing method, a video conference server, and a system. BACKGROUND Currently, video conferencing systems generally include a conference server and a venue terminal. See Figure 1 and Figure 1 for an example of a multipoint control unit (MCU) as a conference server. Each site in the video conferencing system has at least one site terminal. Each site terminal collects the sound and image of each site and encodes it to the MCU. The MCU processes the sound and image according to a certain processing method, such as sound mixing. The image is forwarded or composed into a multi-picture, and the processed sound and image are sent to other venue terminals in the video conference. The other venue terminals decode and output the sound and image of the remote conference site to implement remote video communication.

At present, the principle of the MCU's audio mixing processing in the video conference is to mix the audio of the N venues with the highest volume in the conference terminal, and the sounds heard by all the participating terminals are the same. However, the current video conferencing system sometimes has the following situations. A site terminal is currently watching the video corresponding to the site A. However, since the voice of the site A is not within the maximum N square, the sound of the site A cannot be heard. Unheard, the mismatch between audio and video will affect the video conferencing experience. SUMMARY OF THE INVENTION Embodiments of the present invention provide a video conference signal processing method, a video conference server, and a system, to improve the matching problem of audio and video in a video conference, so that the voice heard by the venue follows the video seen by the conference site, and the video is improved. Meeting experience.

In order to solve the above technical problem, the embodiment of the present invention provides the following technical solutions. An embodiment of the present invention provides a method for processing a video conference signal, including:

The conference server receives the site selection command of the first site terminal;

The conference server sends a video stream to the first site terminal according to the site selection command; wherein the video stream includes the first site terminal to select a video stream corresponding to the site;

If the first mixed audio stream currently being played by the first site terminal does not include the audio stream corresponding to the first site terminal, the conference server generates a second mixed audio stream, and sends the second mixed audio to the first site terminal. And the second hybrid audio stream includes a part or all of the audio streams corresponding to the first site terminal to select the viewing site.

The embodiment of the invention further provides a video conference server, including:

a receiving module, configured to receive a site selection command sent by the first site terminal, and transmit the instruction to the video stream sending module and the audio stream sending module, respectively;

a video stream sending module, configured to send a video stream to the first conference terminal, where the video stream includes a video stream corresponding to the first site terminal and the conference site;

The audio stream sending module is configured to: if the currently played first mixed audio stream does not include the audio stream corresponding to the first site terminal selection site, generate a second mixed audio stream, and send the second mixed audio stream to the first The site terminal, where the second mixed audio stream includes a part or all audio streams corresponding to the first site terminal to select the viewing site.

The embodiment of the invention further provides a video conference system, including:

a conference server, configured to receive a site selection command sent by the first site terminal; send the first site terminal to the first site to select a video stream corresponding to the site; if the first site audio channel currently played by the first site terminal does not include The first site terminal selects the audio stream corresponding to the site, generates a second mixed audio stream, and sends the second mixed audio stream to the first site terminal, where the second mixed audio stream includes the first site terminal selection. View some or all of the audio streams corresponding to the site;

a first site terminal, configured to send a site selection command to the conference server; receive a video stream corresponding to the first site terminal from the conference server, and a second mixed audio stream, where the second hybrid audio stream includes A site terminal selects to view part or all of the audio stream corresponding to the site; and plays the video stream and the second mixed audio stream.

As can be seen from the above, since the audio stream of some or all of the sites selected by the first site terminal is added to the second mixed audio stream sent to the first site terminal, the audio and video are improved to some extent. The problem of out of sync improves the user experience.

BRIEF DESCRIPTION OF THE DRAWINGS In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below, and obviously, in the following description The drawings are only some of the embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.

1 is a schematic diagram of a video conference system of the prior art;

2 is a schematic flowchart of a method for processing a video conference signal according to an embodiment of the present invention; FIG. 3 is a schematic diagram of a first application scenario of a video conference signal processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a second type of application scenario of a video conference signal processing method according to an embodiment of the present invention; FIG.

FIG. 5 is a schematic structural diagram of a conference server according to an embodiment of the present disclosure;

6 is a schematic structural diagram of another conference server according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a video conference system according to an embodiment of the present invention.

The embodiments of the present invention provide a video conference signal processing method, a video conference server, and a system, so as to enhance the matching of audio and video in a video conference and improve the video conference experience.

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

An embodiment of the present invention provides a method for processing a video conference signal, including:

The conference server receives the site selection command sent by the first site terminal;

If the first mixed audio stream currently being played by the first site terminal does not include the audio stream corresponding to the first site terminal, the conference server generates a second mixed audio stream, and sends the second mixed audio to the first site terminal. a stream, where the second mixed audio stream includes a first venue terminal selection View some or all of the audio streams corresponding to the site.

It can be seen that, in this embodiment, part or all of the audio streams corresponding to the viewing site are added to the audio stream currently played by the first site terminal, so that the audio stream played by the first site terminal follows the video stream, which is to some extent Improved audio and video out of sync issues and enhanced user experience. The embodiment of the present invention further provides a method for frequency conference signal processing. Referring to FIG. 2, the method includes the following content:

5101. The conference server receives the command of the N1 site sent by the first site terminal, where the site selection command sent by the first site includes an instruction to select a site and an instruction to select multiple sites, and the first site terminal It can be a communication terminal such as a PC or a mobile phone; N1 is an arbitrary integer greater than or equal to 1.

5102. The conference server sends a video stream to the first site terminal according to the command of the N1 site. The video stream includes the video stream corresponding to the N1 site selected by the first site terminal.

Before the conference server sends the video stream to the first site terminal, the video stream can be spliced through the built-in video processing policy, and then the spliced video stream is sent to the first site terminal.

S103. If the first mixed audio stream that is currently played by the first site terminal does not include the audio stream corresponding to the first site terminal, the conference server generates a second mixed audio stream, and sends the second to the first site terminal. The audio stream is mixed, wherein the second mixed audio stream includes a part or all of the audio stream corresponding to the first site terminal to select the viewing site.

It should be noted that, before the first site terminal issues the selection command, the first site terminal plays the first mixed audio stream, and after the first site terminal issues the selection command, the conference server determines the currently played first mixed audio. Whether the stream includes the audio stream corresponding to the selected site, and if not, the conference server sends the second mixed audio stream including the audio stream corresponding to the selected site to the first site terminal, if the first site terminal includes If the audio stream corresponding to the site is selected, the second mixed audio stream does not need to be resent.

If the first site terminal sends an instruction to select a site to the conference server, the conference server transmits the audio stream included in the second mixed audio stream to the first site terminal in the same channel, or the conference server transmits the second mixed audio. The audio stream corresponding to the selected site and the other audio streams in the second mixed audio stream are respectively transmitted to the first venue terminal in different channels; If the first site terminal sends an instruction to select a plurality of sites to the conference server, the conference server sends the second hybrid audio stream of the audio stream of the N4 sites with the highest volume among the selected sites to the first site terminal. The conference server may also transmit the audio stream of the N4 conference sites and the other audio streams in the second hybrid audio stream that are selected to be the highest volume in the conference site to the first conference terminal, where N4 is smaller than Or equal to the number of people selected to watch the venue.

The first mixed audio stream is an audio stream that is being played by the first site terminal before the current site selection command is transmitted, and the first mixed audio stream is obtained by mixing the N3 audio streams.

After the conference server sends the second mixed audio stream to the first site terminal, the audio stream played by the first site terminal becomes the second mixed audio stream, that is, the second mixed audio stream replaces the first mixed audio stream to become the current venue terminal. Audio stream.

It can be seen that the audio stream of some or all of the sites selected by the first site terminal is added to the second mixed audio stream sent to the first site terminal, which improves the problem of audio and video out of synchronization to some extent. , enhanced user experience.

The conference server generates a second hybrid audio stream, and multiple policies can be used. The following is an example of the first, the second, the third, and the fourth.

Strategy 1: The second mixed audio stream sent by the conference server to the first conference terminal only includes part or all of the audio streams selected to be viewed, as follows.

The conference server mixes and processes the N4 audio streams corresponding to the N4 sites in the N1 sites that are selected by the first site terminal to obtain a second mixed audio stream, where N1 and N4 are both greater than or equal to 1. An integer, and N4 is less than or equal to N1;

Strategy 2: The second mixed audio stream sent by the conference server to the first conference terminal includes not only part or all of the audio streams selected by the conference site, but also audio streams corresponding to the N2 conference venues having the highest volume in the conference site, and The number of audio streams included in the second mixed audio stream is equal to the number of audio streams included in the first mixed audio stream, as described below.

The conference server mixes the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, and obtains the first And the first mixed audio stream played by the first site terminal before being selected to view the site command is obtained by mixing the N3 audio streams, and N3=N4+N2; Strategy 3: the conference server is first The second mixed audio stream sent by the site terminal does not only contain Selecting part or all of the audio stream of the conference site, and also including the audio stream contained in the first mixed audio stream. In this case, the number of audio streams of the second mixed audio stream is greater than the number of the first mixed audio stream, specifically As described below.

The conference server mixes the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and the N3 audio streams of the first mixed audio stream obtained by the mixing process, to obtain the first Two mixed audio streams;

Strategy 4: The second mixed audio sent by the conference server to the first conference site includes not only part or all of the audio streams selected by the conference site, but also audio streams corresponding to the N3 conference venues with the highest volume in the conference site. N3 is equal to The number of audio streams included in the first mixed audio stream. In this case, the second mixed audio stream includes more audio streams than the first mixed audio stream, as described below.

The conference server mixes the N4 audio streams corresponding to the N4 conference sites with the highest volume in the N1 conference sites selected by the first site terminal, and the N3 audio streams corresponding to the N3 conference sites with the highest volume in the conference site, and obtains the first Two mixed audio streams.

In practical applications, N2, N3 and N4 can be configured by the background. The information sent by the site terminal to the conference server includes, but is not limited to, the number of sites, the audio data of each site, and the mixing mode. The mixing mode may be to directly mix the sound of the venue sound or to select the conference sound and the selected voice. The sounds of the venue are placed in different channels for output. The audio data of each site includes: audio stream parameters such as site number, audio stream identifier, and audio gain.

In order to better understand the above solution, the embodiment of the present invention provides two types of application scenarios of a video conference signal processing method. The first type of application scenario is that the conference server receives an instruction for selecting a single conference site sent by the conference terminal, as shown in FIG. 3, including an application. Scenario 1, application scenario 2, application scenario 3, and application scenario 4, the second type of application scenario is that the conference server receives the command to select multiple sites sent by the conference terminal, as shown in FIG. 4, including application scenario 5, application scenario 6, application. Scene VII, Application Scenario 8 and Application Scenario 9.

Application scenario 1:

The conference server only sends the audio stream corresponding to the single site selected by the first site terminal to the first site terminal.

Assume that the conference site in the current videoconferencing system is site A, site B, site C, site D, and conference. The first mixed audio stream currently played by the site B is mixed with the audio stream corresponding to the site D.

After the MCU of the conference server receives the command of the site E sent by the site A, the MCU sends the video stream corresponding to the field E to the site A. In this case, the first mixed audio stream does not include the audio stream corresponding to the site E. The conference server sends the audio stream corresponding to the field E to the site A.

In this case, the site A hears the sound of the site E, and at the same time, the video of the site E is seen, and the effect of the sound following the image is achieved.

Application scenario 2:

The conference server sends the audio stream corresponding to the single site selected by the first site terminal to the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, and the N2+1 is the first hybrid. The number of audios corresponding to the audio stream.

Assume that the current conference site is the site A, the site B, the site C, the site D, and the site E. The site sound size is the site B, the site C, the site D, the site A, and the site E. The first mix is currently played. The audio stream is mixed by the audio stream corresponding to the site B, the site C, and the site D3. When the MCU of the conference server receives the command of the site E sent by the conference terminal of the site A, the MCU will video corresponding to the site E. The stream is sent to the site A. The first mixed audio stream does not contain the audio stream corresponding to the selected site E. The MCU will have the maximum volume of the site. The audio stream corresponding to the site B and the site C plus the audio corresponding to the selected site E. The stream is subjected to a mixing process to obtain a second mixed audio stream, and the second mixed audio stream is sent to the site A, wherein the number of audios of the second mixed audio stream is equal to the number of audios of the first mixed audio stream. Both are three.

In this case, the site A sees the image of the site ^ site C and the site E, and hears the sounds of the site B, the site C, and the site E, to achieve the effect of the sound following the image.

Application scenario three:

The conference server sends the audio stream corresponding to the single site selected by the first site terminal to the first site terminal by adding the audio stream corresponding to the first mixed audio stream.

Assume that the current conference site is the site A, the site B, the site C, the site D, and the site E. The site sound size is the site B, the site C, the site D, the site A, and the site E. The first mix is currently played. The audio stream is mixed by the audio stream corresponding to the two sides of the site B and the site C;

When the MCU of the conference server receives the command of the conference site E sent by the conference terminal of the conference site A Afterwards, the MCU sends the video stream corresponding to the field E to the site A. The first mixed audio stream does not include the audio stream corresponding to the selected site E. The MCU selects the audio stream corresponding to the site E and the audio corresponding to the first mixed audio stream. The stream is mixed, that is, the audio stream corresponding to the field £, the site B, and the site C is mixed, and the second mixed audio stream is obtained and sent to the site A.

Application scenario four:

The conference server sends the audio stream corresponding to the single site of the first site to the first site, and the N3 audio stream corresponding to the N3 sites in the conference site is sent to the first site terminal, where N3 is equal to the first mixed audio. The number of audio streams that the stream contains.

Assume that the current conference site is the site A, the site B, the site C, the site D, and the site E. The site sound size is the site B, the site C, the site D, the site A, and the site E. The first mix is currently played. The audio stream is mixed by the audio stream corresponding to the two sides of the site B and the site D;

After the MCU of the conference server receives the command of the site E sent by the conference terminal of the site A, the MCU sends the video stream corresponding to the field E to the site A. The first mixed audio stream does not include the selected site E, so the MCU will The maximum volume of the conference site is 2 (equal to the number of audios in the first mixed audio stream). The conference site, that is, the audio stream corresponding to the site B and the site C, plus the audio stream corresponding to the selected site E, is mixed and processed. The second mixed audio stream is sent to the conference site A.

Application scenario five:

The conference server sends the audio stream corresponding to the multiple sites selected by the first site terminal to the first site terminal.

Assume that the conference site in the current videoconferencing system is site A, site B, site C, site D, site E, and site F. The site size is site B, site C, site D, site A, site E, and site F. The currently mixed first audio stream is mixed by the audio stream corresponding to the two sites of the site B and the site D;

When the MCU of the conference server receives the conference site C and the site E sent by the conference site A. The video stream is sent to the site A. The first mixed audio stream does not include the audio stream corresponding to the site C and the site E. Therefore, the MCU mixes the audio streams corresponding to the site C and the site E to obtain a second. Mix the audio stream and send the second mixed audio stream to Site A.

In this case, the site A sees the image of the site C and the site E, and hears the sound of the site C and the site E to achieve the effect of the sound following the image.

Application scenario six:

The conference server sends the N1 audio streams corresponding to the N1 sites selected by the first site terminal to the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, and sends the N2 audio streams to the first site terminal, where N1+N2 Equal to the number of audios corresponding to the first mixed audio stream.

Assume that the conference site in the current videoconferencing system is site A, site B, site C, site D, and site, site F. The site is in the order of site B, site C, site D, site A, site E, and site F. The currently played first mixed audio stream is mixed by the audio stream corresponding to the site B, the site D, and the F3 side of the site;

After the MCU of the conference server receives the command of the site C and the site E sent by the site A, the MCU performs video splicing processing on the video stream corresponding to the site C and the site E, and sends the spliced video stream to the site A. The first mixed audio stream does not include the audio stream corresponding to the site C and the site E. Therefore, the MCU will join the largest site in the site, that is, the audio stream of site B plus the selected site C and site E. The audio stream is subjected to a mixing process to obtain a second mixed audio stream, and the second mixed audio stream is sent to the site A, wherein the number of audios of the second mixed audio stream is equal to the number of audios of the first mixed audio stream. Both are three.

Application scenario seven:

The conference server sends the audio stream corresponding to the multiple site selected by the first site terminal, and the audio stream corresponding to the first mixed audio stream to the first site terminal.

Assume that the conference site in the current videoconferencing system is site A, site B, site C, site D, and site, site F. The site is in the order of site B, site C, site D, site A, site E, and site F. The currently played first mixed audio stream is mixed by the audio stream corresponding to the three parties of the venue ^ conference site D and the conference site F;

When the MCU of the conference server receives the command of the site C and the site E sent by the site A. After that, the MCU performs video splicing processing on the video stream corresponding to the site C and the site E, and sends the spliced video stream to the site A. The first mixed audio stream does not include the audio stream corresponding to the selected site C and the site E. Therefore, the MCU mixes the audio stream corresponding to the first mixed audio stream and the audio stream corresponding to the selected site, that is, the field ^ site D, the site F, and the site C and the site E are mixed to obtain a second mixture. The audio stream, and the second mixed audio stream is sent to the venue A.

In this case, the site A sees the image of the site ^ site D, the site F, the site C, and the site E. At the same time, the sounds of the site B, the site D, the site F, the site C, and the venue E are heard, and the effect of the sound following the image is achieved. .

Application scenario eight:

The conference server sends the audio stream corresponding to the multiple sites selected by the first site terminal to the N3 audio streams corresponding to the N3 sites with the highest volume in the conference site, and the N3 is equal to the first mixed audio. The number of audio streams that the stream contains.

After the MCU of the conference server receives the command of the site C and the site E sent by the site A, the MCU performs video splicing processing on the video stream corresponding to the site C and the site E, and sends the spliced video stream to the site A. The first mixed audio stream does not include the audio stream corresponding to the site C and the site E. Therefore, the MCU will enter the conference site with the highest volume of the site, that is, the audio stream corresponding to the site B, the site C, and the site D. The audio stream corresponding to the site C and the site E is mixed. Since the site C has coincidences, the audio streams corresponding to the site B, the site C, the site D, and the site E are mixed. A second mixed audio stream is obtained, and the second mixed audio stream is sent to the conference site A.

In this case, the site A sees the image of the site ^ site C, site D, and site E. At the same time, the sounds of site B, site C, site D, and site E are heard, and the effect of the sound following the image is achieved.

Application scenario nine:

The conference server selects the N4 audio streams corresponding to the N4 venues with the highest volume in the N1 conference sites selected by the first site terminal, and adds the audio streams corresponding to the N2 conference sites with the highest volume in the conference site to the audio stream, and then sends the data to the first The situation of a venue terminal. Assume that the conference site in the current videoconferencing system is site A, site B, site C, site D, and site, site F. The site is in the order of site B, site C, site D, site A, site E, and site F. The currently played first mixed audio stream is mixed by the audio stream corresponding to the site B and the site C. The MCU of the conference server receives the command of the site D, the site E, and the site F sent by the conference terminal of the site A. The MCU will perform video splicing processing on the video stream corresponding to the site D, the site E, and the site F, and send the spliced video stream to the site A. The first mixed audio stream does not include the selected site D, the site E, and the site. The audio stream corresponding to F, so the MCU will be selected to view the audio stream of the first two venues in the venue, that is, the audio stream corresponding to the site D and the site E, plus the audio corresponding to the top two sites in the conference venue. The stream, that is, the audio stream of the site B and the site C, is mixed, that is, the field B, the site C, the site D, and the site E are mixed to obtain a second mixed audio stream, and then the second mixed audio stream. To the venue A.

In this case, the site A sees the image of the site C, the site D, and the site E. At the same time, you can hear the sounds of the site B, the site C, the site D, and the site E, to achieve the effect of the sound following the image. In addition, in the current multi-point video conference, sometimes the current conference sound and the viewed venue sound are mixed together, which may interfere with each other, resulting in inaudible. Therefore, the audio processing strategy may be added, and different strategies may be used to perform audio gain of a specific site. Processing, in order to achieve less interference in each venue, you can hear the sound of the venue that the user cares about.

The audio gain processing uses a configurable strategy to process the audio gain based on the resolution, bandwidth, frame rate, importance, volume, etc. of the venue. Specifically, the audio gain can be adjusted using the following scheme.

Solution 1: The conference server obtains the N4 audio streams corresponding to the N4 sites in the N1 sites and the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site. Select the gain of the audio stream corresponding to one or more sites in the N4 sites with the highest volume in the site or reduce the gain of the audio stream corresponding to one or more sites in the N2 sites with the highest volume in the acquired site. The gain of the audio stream corresponding to one or more sites in the N4 sites and the gain of the audio stream corresponding to one or more sites in the N2 sites with the highest volume in the acquired site are reduced. The sound of the first site terminal to view the site is greater than the sound corresponding to the N2 site with the highest volume in the conference site;

Solution 2: The conference server obtains the N4 with the highest volume in the N1 sites selected by the first site terminal. The N4 audio streams corresponding to the site and the audio streams corresponding to the N3 sites corresponding to the first mixed audio stream are increased, and the audio stream corresponding to one or more of the N4 sites in the selected site is increased. The gain of the audio stream corresponding to one or more sites in the N3 sites corresponding to the obtained first mixed audio stream, or the audio corresponding to one or more sites in the acquired N4 sites The gain of the stream and the gain of the audio stream corresponding to one or more of the N3 sites are obtained, so that the sound of the first site terminal selecting to view the site is greater than the sound of the N3 audio streams corresponding to the first mixed audio stream. Where N4 is less than or equal to N1;

Solution 3: The conference server obtains N3 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and N3 audio streams corresponding to the N3 sites with the highest volume in the conference site. The audio stream corresponding to one or more sites in the N4 sites with the highest volume in the conference site is selected or the audio stream corresponding to one or more sites in the N3 sites with the highest volume in the acquired conference site is selected. Gain, or increase the gain of the audio stream corresponding to one or more sites in the acquired N4 sites and reduce the gain of the audio stream corresponding to one or more sites in the acquired N3 sites, so that the first site The sound that the terminal chooses to view the site is greater than the sound corresponding to the N3 sites with the highest volume in the conference site, where N4 is less than or equal to N1.

Further, the sound of the second conference audio terminal in the second mixed audio stream that is played by the first conference terminal is selected to be 1.2 to 1.5 times of the other voices in the second hybrid audio stream. In order to facilitate the implementation of the technical solution of the embodiment of the present invention, the embodiment of the present invention further provides a video conference server and a video conference system for implementing the foregoing solution.

An embodiment of the present invention provides a video conference server, including:

The audio stream sending module may perform the audio stream included in the generated second mixed audio stream. Transmitting to the first site terminal in the same channel; or transmitting the audio stream corresponding to the selected site in the generated second mixed audio stream and the other audio streams in the second mixed audio stream to different channels respectively The first venue terminal.

It can be seen that, because the audio stream sending module in the video conference server adds the audio stream of part or all of the sites selected by the first site terminal to the second mixed audio stream sent by the first site terminal, the first site terminal is The played audio stream follows the video stream, which improves the user's audio and video out-of-synchronization to a certain extent and enhances the user experience. At the same time, the audio stream corresponding to the selected conference site in the second mixed audio stream and the other audio streams in the second mixed audio stream are respectively transmitted in different channels, which can reduce mutual interference of sounds between the venues, and improve the first conference venue. The quality of the audio played by the terminal.

An embodiment of the present invention further provides a video conference server, as shown in FIG. 5, including:

The receiving module 501 is configured to receive a site selection command sent by the first site terminal, and send the instruction to the video stream sending module 502 and the audio stream sending module 503, respectively;

The video stream sending module 502 is configured to send a video stream to the first conference terminal, where the video stream includes a video stream corresponding to the first site terminal and the conference site;

The audio stream sending module 503 is configured to generate a second mixed audio stream if the first audio stream currently played by the first site terminal does not include the audio stream corresponding to the first site terminal, and the second mixed audio stream is generated. The second hybrid audio stream is sent to the first site audio terminal, where the first site audio terminal includes a part or all of the audio stream corresponding to the viewing site.

It should be noted that the audio stream sending module 503 is configured to transmit the audio stream included in the generated second mixed audio stream to the first venue terminal in the same channel, or select the generated second mixed audio stream from the selected conference site. The corresponding audio stream and the other audio streams in the second mixed audio stream are respectively transmitted to the first venue terminal in different channels.

It can be seen that, because the audio stream sending module 503 in the video conference server adds the audio stream of part or all of the sites selected by the first site terminal to the second mixed audio stream sent by the first site terminal, The played audio stream follows the video stream, which improves the user's audio and video out-of-synchronization to a certain extent and enhances the user experience. At the same time, the audio stream corresponding to the selected conference site in the second mixed audio stream and the other audio streams in the second mixed audio stream are respectively transmitted in different channels, which can reduce mutual interference of sounds between the venues, and improve the first conference venue. Terminal playing audio the quality of.

An embodiment of the present invention further provides a video conference server. Referring to FIG. 6, the method includes:

The receiving module 601 is configured to receive the site selection command sent by the first site terminal, and send the instruction to the video stream sending module 602 and the audio stream sending module 603, respectively;

The video stream sending module 602 is configured to send a video stream to the first conference terminal, where the video stream includes a video stream corresponding to the first site terminal and the conference site;

The audio stream sending module 603 includes an obtaining module 603a and an audio gain gain processing module 603b. The acquiring module 603a is configured to obtain N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal. The N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, or the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal are corresponding to the first mixed audio stream. N3 audio streams, or N3 audio streams corresponding to the N3 venues with the highest volume in the N1 conference sites selected by the first site terminal, and N3 audio streams corresponding to the N3 conference sites with the highest volume in the conference site. Where N4 is less than or equal to N1;

The audio gain processing module 603b is configured to increase the gain of the audio stream corresponding to one or more of the N4 conference sites with the highest volume among the N1 conference sites selected by the first site terminal to be obtained, or to obtain the obtained conference venue. The gain of the audio stream corresponding to one or more sites in the N2 sites with the highest volume is reduced, or the audio stream gain corresponding to one or more sites in the acquired N4 sites is increased and the volume of the participant site is the largest. The audio stream gain corresponding to one or more sites in the N2 sites, so that the sound of the first site terminal to view the site is greater than the voice corresponding to the N2 site with the highest volume in the conference site;

Or,

The audio stream gain corresponding to one or more sites in the N3 sites of the obtained first mixed audio stream is reduced, or the N4 sites in the N1 sites that are selected by the first site terminal are increased in the N4 sites. The gain of the audio stream corresponding to the one or more sites and the audio stream gain corresponding to one or more of the N3 sites corresponding to the first mixed audio stream obtained by the mixing process, so that the first site terminal selects to view The sound of the site is greater than the sound corresponding to the N3 sites of the first mixed audio stream;

or, The gain of the audio stream corresponding to one or more sites in the N3 sites with the highest volume in the conference site is reduced, or the number of N4 sites in the N1 sites selected by the first site terminal is increased. The gain of the audio stream corresponding to one or more sites and the audio stream gain of the N3 sites with the highest volume in the conference site, so that the sound of the first site terminal to view the site is greater than the N3 site with the highest volume in the conference site. sound.

The audio gain processing module 603b can adjust the sound corresponding to the N4 venues having the highest volume in the N1 conference sites selected by the first terminal to be 1.2 to 1.5 times the sound corresponding to the N2 conference venues having the highest volume in the conference venue;

Or, the audio gain processing module 603b adjusts the sound corresponding to the N4 conference sites with the largest volume of the N1 conferences selected by the first terminal to be 1.2-1.5 times of the sound corresponding to the N3 conference sites of the first mixed audio stream;

Alternatively, the audio gain processing module 603b adjusts the sound corresponding to the N4 venues with the largest volume of the N1 venues selected by the first terminal to 1.2-1 times of the sound corresponding to the N3 venues with the highest volume in the conference venue. The embodiment of the invention provides a video conference system, including:

The conference server may transmit the audio stream included in the generated second mixed audio stream to the first conference terminal in the same channel; or, the selected second hybrid audio stream is selected to view the audio stream corresponding to the conference site. The other audio streams in the second mixed audio stream are respectively transmitted to the first venue terminal in different channels.

It can be seen that the conference server in the video conference system provided by the embodiment of the present invention will The audio stream of the part or all of the site selected by the first site terminal is added to the second mixed audio stream sent by the first site terminal, so that the audio stream played by the site terminal follows the video stream, which improves the site audio to some extent. The problem of being out of sync with the video enhances the user experience. At the same time, the audio stream corresponding to the selected conference site in the second mixed audio stream and the other audio streams in the second mixed audio stream are respectively transmitted in different channels, which can reduce mutual interference of sounds between the venues, and improve the first conference venue. The quality of the audio played by the terminal. The embodiment of the present invention further provides another video conferencing system, as shown in FIG. 7, including: a conference server 701, configured to receive a site selection command sent by the first site terminal 702; and send a first site to the first site terminal 702. The terminal 702 selects the video stream corresponding to the site; if the first mixed audio stream currently played by the first site terminal 702 does not include the audio stream corresponding to the first site terminal 702, the second hybrid is sent to the first site terminal 702. The audio stream, where the second mixed audio stream includes the first site terminal 702 selecting to view part or all of the audio stream corresponding to the conference site;

The first site terminal 702 is configured to send a site selection command to the conference server 701. The first site terminal 702 that receives the conference site 701 selects a video stream corresponding to the site and a second hybrid audio stream, where the second hybrid The audio stream includes a first site terminal 702 that selects to view part or all of the audio stream corresponding to the site; and plays the video stream and the second mixed audio stream.

The generating, by the conference server 701, the second mixed audio stream includes:

The conference server 701 performs mixing processing on the N4 audio streams corresponding to the N4 conference sites with the highest volume among the N1 conference sites selected by the first site terminal, to obtain a second mixed audio stream, where N4 is smaller than or equal to N1;

Or,

The conference server 701 mixes and processes the N4 audio streams corresponding to the N4 conference sites with the highest volume in the N1 conference sites selected by the first site terminal, and the N2 audio streams corresponding to the N2 conference sites with the highest volume in the conference site. a second mixed audio stream; wherein, N4 is less than or equal to N1, the first mixed audio stream is obtained by mixing the N3 audio streams, and N4 plus N2 is equal to N3;

Or,

The conference server 701 performs the mixing process on the audio stream in the first audio stream set to obtain the second mixed audio stream, where the first audio stream set includes: N4 of the N1 sites that are selected by the first site terminal and having the highest volume. N4 audio streams corresponding to the venue, and the first mixed audio stream obtained by the mixing process N3 audio streams, where N4 is less than or equal to N1; or, the first audio stream set includes: N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and the volume in the conference venue The N3 audio streams corresponding to the largest N3 sites, where N4 is less than or equal to N1.

As can be seen from the above, the conference server 701 in the video conference system provided by the embodiment of the present invention adds part or all of the venues selected by the first conference terminal to the second mixed audio stream sent to the first conference terminal. The audio stream is such that the audio stream played by a conference terminal follows the video stream, which improves the user's audio and video unsynchronization to a certain extent and enhances the user experience. At the same time, the audio stream corresponding to the selected conference site in the second mixed audio stream and the other audio streams in the second mixed audio stream are respectively transmitted in different channels, which can reduce mutual interference of sounds between the venues, and improve the first conference venue. The quality of the audio played by the terminal. Further, the embodiment of the present invention further provides a video conference system, where the system also includes a conference server and a first conference terminal. The difference from the previous embodiment is that the conference server in the system has the above functions. It can be used to increase the audio stream gain corresponding to one or more of the N4 sites with the highest volume in the N1 sites selected by the first site terminal, or to reduce one or more of the N2 sites with the highest volume in the conference site. The audio gain corresponding to one site or multiple sites in the N1 sites with the highest volume in the N1 sites selected by the first site terminal and the N2 of the largest volume in the conference site. The audio stream gain corresponding to one or more sites in the site, so that the sound of the first site terminal to view the site is greater than the sound corresponding to the N2 site with the highest volume in the conference site, where N4 is less than or equal to N1; or

The audio stream gain of one or more sites in the N3 sites of the first mixed audio stream obtained by the mixing process is increased, or the N4 sites with the highest volume in the N1 sites selected by the first site terminal are simultaneously increased. The audio stream gain of the one or more sites in the one or more sites and the audio stream gain of one or more of the N3 sites in the first mixed audio stream obtained by the mixing process, so that the first site terminal selects to view the site. The sound is greater than the sound corresponding to the N3 conference sites of the first mixed audio stream, where N4 is less than or equal to N1;

Or,

It is used to reduce the audio stream gain of one or more sites in the N3 sites with the highest volume in the conference site, or increase the maximum volume of N1 sites in the N1 sites selected by the first site terminal. The audio stream gain corresponding to one or more sites in the conference site and the largest volume in the conference site.

The audio stream gain of one or more of the N3 sites is such that the sound of the first site terminal to view the site is greater than the sound of the N3 site with the highest volume in the conference site, where N4 is less than or equal to N1.

It can be understood that the conference server 701 in the embodiment may be the conference server in the foregoing method embodiment, and the functions of the respective function modules may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may be implemented by referring to the foregoing method. The related description of the example will not be described here.

It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because in accordance with the present invention, certain steps may be performed in other sequences or concurrently. In addition, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

In the above embodiments, the descriptions of the various embodiments are different, and the details are not described in detail in an embodiment, and the related descriptions of other embodiments can be referred to.

A video conference signal processing method and conference service provided by an embodiment of the present invention

The manners of the above embodiments are only used to help understand the method of the present invention and its core ideas; at the same time, for those skilled in the art, according to the idea of the present invention, in the specific embodiments and application scopes, In view of the above, the contents of the specification are not to be construed as limiting the invention.

Claims

Claim

A method for processing video conference signal, characterized in that it comprises:

The conference server sends a video stream to the first site terminal according to the site selection command; wherein the video stream includes the video stream corresponding to the site of the first site terminal;

2. The method of claim 1 wherein

The conference server generates a second hybrid audio stream, including:

The conference server mixes the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal to obtain a second mixed audio stream, where N4 is less than or equal to N1;

Or,

The conference server mixes the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, and obtains the first a mixed audio stream; wherein, N4 is less than or equal to N1, the first mixed audio stream is obtained by mixing the N3 audio streams, and N4 plus N2 is equal to N3;

Or,

The conference server performs the mixing process on the audio stream in the first audio stream set to obtain the second mixed audio stream, where the first audio stream set includes: the N4 conference sites with the highest volume among the N1 sites selected by the first site terminal. Corresponding N4 audio streams, and N3 audio streams of the first mixed audio stream obtained by the mixing process, wherein N4 is less than or equal to N1; or, the first audio stream set includes: N1 sites selected by the first site terminal The N4 audio streams corresponding to the N4 sites with the highest volume are the N3 audio streams corresponding to the N3 sites with the highest volume in the conference site, where N4 is less than or equal to N1.

The method according to claim 2, wherein the method further comprises: the conference server transmitting the audio stream included in the second mixed audio stream to the first conference terminal in the same channel;

or, The conference server transmits the audio stream corresponding to the selected conference site and the other audio streams in the second hybrid audio stream to the first conference terminal in different channels.

The method according to claim 2, wherein the generating, by the conference server, the second mixed audio stream is:

The conference server obtains the N4 audio streams corresponding to the maximum N4 sites in the N1 sites and the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, and increases the N4 volume with the largest volume. The gain of the audio stream corresponding to one or more sites in the conference site or the gain of the audio stream corresponding to one or more sites in the N2 sites with the highest volume in the conference site, or the acquired N4 The gain of the audio stream corresponding to one or more sites in the conference site and the gain of the audio stream corresponding to one or more sites in the N2 sites with the highest volume in the conference site, so that the first site terminal selects to view The sound of the site is greater than the sound of the N2 sites in which the volume of the site is the largest. The N4 is less than or equal to N1, and the N4 audio streams corresponding to the N2 sites are obtained, and the N2 audio streams corresponding to the N2 sites are obtained. Mixing processing to obtain a second mixed audio stream;

Or,

The conference server obtains the N4 audio streams corresponding to the N4 sites and the N3 audio streams corresponding to the first mixed audio stream in the N1 sites selected by the first site terminal, and increases the N4 sites with the highest volume. The gain of the audio stream corresponding to one or more sites or the gain of the audio stream corresponding to one or more of the N3 sites corresponding to the acquired first mixed audio stream, or simultaneously increase the volume of the acquired volume The gain of the audio stream corresponding to one or more sites in the N4 sites and the gain of the audio stream corresponding to one or more sites in the N3 sites corresponding to the acquired first mixed audio stream, so that the first site terminal The sound of the N3 audio streams corresponding to the first mixed audio stream is greater than or equal to N1, and the N4 audio streams corresponding to the acquired N4 conference sites are corresponding to the acquired first mixed audio stream. N3 audio streams are subjected to mixing processing to obtain a second mixed audio stream;

Or,

The conference server obtains N3 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and N3 audio streams corresponding to the N3 sites with the highest volume in the conference site, and increases the volume of the obtained volume. Gain or decrease of the audio stream corresponding to one or more sites in the N4 sites Gain the gain of the audio stream corresponding to one or more sites in the N3 sites with the highest volume in the conference site, or increase the gain and decrease of the audio stream corresponding to one or more sites in the acquired N4 sites. The gain of the audio stream corresponding to one or more of the N3 sites is obtained, so that the sound of the first site terminal to view the site is greater than the sound of the N3 site with the highest volume in the conference site, where N4 is less than or equal to N1. The N3 audio streams corresponding to the acquired N4 sites are mixed and processed by the N3 audio streams corresponding to the N3 conference sites having the highest volume in the conference site to obtain a second mixed audio stream.

The method according to claim 4, wherein the sound corresponding to the N4 venues having the highest volume among the N1 conference sites selected by the first venue terminal in the second mixed audio stream that is played is the second mixed audio stream. 1.2-1.5 times of other sounds.

6. A video conference server, comprising:

a receiving module, configured to receive a site selection command sent by the first site terminal, and transmit the instruction to the video stream sending module and the audio stream sending module respectively;

The video conference server according to claim 6, wherein the audio stream sending module generates the second mixed audio stream specifically:

The audio stream sending module performs mixing processing on the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, to obtain a second mixed audio stream, where N4 is less than or equal to N1;

Or,

The audio stream sending module mixes the N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and the N2 audio streams corresponding to the N2 sites with the highest volume in the conference site. Obtaining a second mixed audio stream; wherein, N4 is less than or equal to N1, the first mixed audio stream is obtained by mixing processing by N3 audio streams, and N4 plus N2 is equal to N3; or,

The audio stream sending module performs the mixing process on the audio stream in the first audio stream set to obtain the second mixed audio stream, where the first audio stream set includes: the N4 with the highest volume among the N1 sites selected by the first site terminal. N1 audio streams corresponding to the site, and N3 audio streams of the first mixed audio stream obtained by the mixing process, wherein N4 is less than or equal to N1; or, the first audio stream set includes: N1 selected by the first site terminal N3 audio streams corresponding to the N4 sites with the highest volume in the conference site, and N3 audio streams corresponding to the N3 conference sites with the highest volume in the conference site, where N4 is less than or equal to N1.

The video conference server according to claim 7, wherein the audio stream sending module transmits the audio stream included in the generated second mixed audio stream to the first conference terminal in the same channel;

Or,

The audio stream sending module transmits the audio stream corresponding to the selected conference site and the other audio streams in the second mixed audio stream to the first conference terminal in different channels.

The video conference server according to claim 7, wherein the audio stream sending module further comprises an acquiring module and an audio gain processing module;

The acquiring module is configured to obtain N4 audio streams corresponding to the N4 sites with the highest volume in the N1 sites selected by the first site terminal, and N2 audio streams corresponding to the N2 sites with the highest volume in the conference site, or Obtaining N4 audio streams corresponding to the N4 sites with the highest volume and the N3 audio streams corresponding to the first mixed audio stream in the N1 sites selected by the first site terminal, or acquiring the N1 of the first site terminal selection N3 audio streams corresponding to the N4 sites with the highest volume in the conference site and N3 audio streams corresponding to the N3 conference sites with the highest volume in the conference site, where N4 is less than or equal to N1;

The audio gain processing module is configured to increase the gain of the audio stream corresponding to one or more of the N4 sites in the N1 sites that are selected by the first site terminal to be obtained, or to obtain the participating conference site. The gain of the audio stream corresponding to one or more sites in the N2 sites with the highest volume is reduced, or the audio stream gain corresponding to one or more sites in the acquired N4 sites is increased and the volume of the participant site is the largest. The audio stream gain corresponding to one or more sites in the N2 sites, so that the sound of the first site terminal to view the site is greater than the voice corresponding to the N2 site with the highest volume in the conference site;

or, The audio stream gain corresponding to one or more sites in the N3 sites of the obtained first mixed audio stream is reduced, or the N4 sites in the N1 sites that are selected by the first site terminal are increased in the N4 sites. The gain of the audio stream corresponding to the one or more sites and the audio stream gain of one or more of the N3 sites of the first mixed audio stream obtained by the mixing process, so that the first site terminal selects to view the site. The sound is greater than the sound corresponding to the N3 sites of the first mixed audio stream;

Or,

The audio gain of one or more of the N3 sites that have the highest volume in the conference site is reduced, or one of the N4 sites with the highest volume in the N1 sites selected by the first site terminal is added or The audio stream gain of the site and the audio stream of one or more sites in the N3 sites with the highest volume in the conference site, so that the first site terminal selects to view the site more than the N3 in the conference site. The sound corresponding to the venue.

The video conference server according to claim 9, wherein the audio gain processing module adjusts the sound corresponding to the N4 venues having the highest volume among the N1 conference sites selected by the first terminal to be the highest volume in the conference venue. 1.2 to 1.5 times the sound corresponding to the N2 venues;

Or the audio gain processing module adjusts the sound corresponding to the N4 venues with the highest volume in the N1 conference sites that the first terminal selects to view to 1.2 to 1.5 times the sound corresponding to the N3 conference sites of the first mixed audio stream;

Alternatively, the audio gain processing module adjusts the sound corresponding to the N4 venues with the highest volume in the N1 conference sites selected by the first terminal to 1.2-1 times of the sound corresponding to the N3 conference venues with the highest volume in the conference venue.

11. A video conferencing system, comprising:

a first site terminal, configured to send a site selection command to the conference server; receive a video stream corresponding to the first site terminal from the conference server, and a second mixed audio stream, where the second hybrid audio stream includes A site terminal selects to view part or all of the audio stream corresponding to the site; Video stream and second mixed audio stream.

The video conference system according to claim 11, wherein the conference server generates the second hybrid audio stream, including:

Or,

The video conference system according to claim 12, wherein the conference server transmits the audio stream included in the generated second mixed audio stream to the first conference terminal in the same channel;

Or,

The conference server transmits the audio stream corresponding to the selected conference site and the other audio streams in the second hybrid audio stream to the first conference terminal in different channels.

The video conference system according to claim 12, wherein the conference server is further configured to add one or more of the N4 conference sites with the highest volume among the N1 sites that the first site terminal selects to view. The audio stream gain or the audio stream gain corresponding to one or more sites in the N2 sites with the highest volume in the conference site, or the N4 sites with the highest volume in the N1 sites selected by the first site terminal. Increase the audio stream corresponding to one or more sites and reduce the audio stream corresponding to one or more sites in the N2 sites with the highest volume in the conference site. Benefits, the sound of the first venue terminal to view the venue is greater than the sound corresponding to the N2 venues with the highest volume in the conference venue, where N4 is less than or equal to N1;

Or,

It is used to reduce the audio stream gain of one or more sites in the N3 sites with the highest volume in the conference site, or to increase one or more of the N4 sites with the highest volume in the N1 sites selected by the first site terminal. The audio stream gain corresponding to the site and the audio stream gain corresponding to one or more sites in the N3 sites with the highest volume in the conference site, so that the first site terminal selects to view the site more than the N3 in the conference site. The sound corresponding to the venue, where N4 is less than or equal to Nl.

The video conference system according to claim 14, wherein the conference server is configured to adjust the voice corresponding to the N4 venues having the highest volume among the N1 conference sites selected by the first site terminal to be added to the conference venue. 1.2 to 1.5 times the sound corresponding to the N2 venues with the highest volume;

Or the conference server is configured to adjust the voice corresponding to the N4 conference sites with the highest volume in the N1 conference sites selected by the first site terminal to be 1.2 to 1.5 times the voice corresponding to the N3 conference sites of the first mixed audio stream;

Alternatively, the conference server is configured to adjust the voice corresponding to the N4 venues having the highest volume in the N1 conference sites selected by the first site terminal to be 1.2 to 1.5 times the voice corresponding to the N3 conference sites having the highest volume in the conference site.