WO2012072008A1 - 视频信号的辅助信息叠加方法及装置 - Google Patents

视频信号的辅助信息叠加方法及装置 Download PDF

Info

Publication number
WO2012072008A1
WO2012072008A1 PCT/CN2011/083005 CN2011083005W WO2012072008A1 WO 2012072008 A1 WO2012072008 A1 WO 2012072008A1 CN 2011083005 W CN2011083005 W CN 2011083005W WO 2012072008 A1 WO2012072008 A1 WO 2012072008A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
signal
information
audio signal
site
Prior art date
Application number
PCT/CN2011/083005
Other languages
English (en)
French (fr)
Inventor
詹五洲
王东琦
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Publication of WO2012072008A1 publication Critical patent/WO2012072008A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present invention relates to the field of multi-screen video communication technologies, and in particular, to a method and an apparatus for superimposing auxiliary information of a video signal. Background technique
  • Telepresence technology is a teleconference technology that combines video communication and communication experience in recent years. Due to its life-size, ultra-high definition and low latency, it focuses on the effect of true face-to-face communication. Strong sense of reality and presence, has been widely used in various video conferencing scenarios.
  • the telepresence conference system pays attention to the sound orientation and image consistency, so it can meet the image and sound requirements of the participants.
  • the existing telepresence technology still has certain deficiencies: in a telepresence conference, it is very There may be participants in different languages, so there may be language barriers between participants in different languages, especially hearing impairments; and, even among participants in the same language, as participants Attention is not concentrated or because of other objective reasons, it may also lead to the situation where the participants cannot understand the other party's speech. Therefore, considering these situations, in the telepresence conference scene, if the participants' speech content is displayed in the form of subtitles at the bottom of the screen, communication between the participants will be greatly facilitated.
  • the telepresence site usually includes multiple screens, which are used to display multiple participants of the remote site respectively. If the video conference subtitle display method is directly followed, the subtitles should not be known. Which screen the information is displayed on. If the subtitle information is directly displayed on the middle screen, when the speaker is located on the left or right screen, the subtitle display mode will result in inconsistency between the image and the subtitle display orientation, so that the local participants can only choose one. It is inconvenient for the participants to watch the image or subtitle of the speaker. Summary of the invention
  • Embodiments of the present invention provide a method and apparatus for superimposing auxiliary information of a video signal, which are used to overcome the defect that the display orientation of the subtitle and the image appearing in the existing telepresence conference technology is inconsistent.
  • an embodiment of the present invention provides a method for adding an auxiliary information of a video signal, including:
  • the indication information is used to indicate a video area where a video object corresponding to the audio signal is located in a plurality of video objects of the at least one video signal; according to the indication information, Text information corresponding to an audio signal of a conference site is superimposed with the video signal, so that the text information is in the indication information Displayed in the indicated video area.
  • an embodiment of the present invention further provides an auxiliary information superimposing apparatus for a video signal, including:
  • a signal acquisition module configured to acquire an audio signal of the first site and at least one video signal of the first site, where the at least one video signal includes multiple video objects in the first site;
  • An indication information acquiring module configured to acquire indication information, where the indication information is used to indicate a video area where a video object corresponding to the audio signal is located in a plurality of video objects of the at least one video signal;
  • a signal superimposing module configured to perform superimposition processing on the text information corresponding to the audio signal of the first site and the video signal according to the indication information, so that the text information is in a video region indicated by the indication information Shown in .
  • the method and device for superimposing the auxiliary information of the video signal provided by the embodiment of the present invention are applied in a multi-screen video communication scenario, and are obtained by indicating that the current audio signal is in the process of superimposing the text information corresponding to the audio signal with the video signal.
  • the video area is superimposed with the video signal, so that when the superimposed processed video signal is displayed on the display screen of the corresponding venue terminal, it can be ensured that the text information corresponding to the audio signal is displayed on the corresponding video object.
  • the consistency of the image and the display orientation of the subtitles is guaranteed.
  • Embodiment 1 is a flowchart of Embodiment 1 of a method for superimposing auxiliary information of a video signal according to an embodiment of the present invention
  • Embodiment 2 is a flowchart of Embodiment 2 of a method for superimposing auxiliary information of a video signal according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a conference site to which an auxiliary information superposition method of a video signal is applied according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram showing a display effect of a superimposed video signal on a multi-screen according to an embodiment of the present invention
  • FIG. 5 is a flowchart of Embodiment 3 of a method for superimposing auxiliary information of a video signal according to an embodiment of the present invention
  • Embodiment 1 of an auxiliary information superimposing apparatus for a video signal according to an embodiment of the present invention
  • FIG. 7 is a schematic structural diagram of Embodiment 2 of an auxiliary information superimposing apparatus for a video signal according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of Embodiment 1 of a method for superimposing auxiliary information of a video signal according to an embodiment of the present invention. As shown in FIG. 1 , the method in this embodiment includes the following steps:
  • Step 100 Acquire an audio signal of the first site and at least one video signal that includes multiple video objects in the first site;
  • the method for superimposing the auxiliary information of the video signal in the embodiment of the present invention may be applied to a telepresence conference system or other types of video communication systems in which a plurality of display screens respectively corresponding to multiple participants of the opposite site are set in the conference site.
  • the steps in the embodiment of the present invention may be performed by an information superimposing device disposed in a conference site or a server.
  • each of the venue terminals is provided with multiple display screens for displaying multiple participants of the peer site, and thus, for each site terminal, the site is established with the peer site.
  • the site terminal will send at least one video signal of the plurality of participants of the local site, that is, the video of the local site, and the current audio of the local site.
  • the audio signal is generated by a participant who is currently speaking in a plurality of participants of the local site. Specifically, the audio signal corresponds to one of a plurality of video objects included in the video signal.
  • connection method In the point-to-point connection mode of the site, the local site and the peer site directly transmit signal data. In the site-to-multipoint connection mode, the signal transmission between the sites is forwarded by the MCU, so that the information superimposing device In other words, the device can be set in the venue terminal or in the MCU.
  • the embodiments of the present invention describe the two embodiments at the same time, and in the embodiment of the present invention, the current audio signal and the video signal are collected, and the collected signal is sent to the local site of the peer site.
  • the first venue In this step, for the information superimposing device for performing the auxiliary information superimposition processing on the video signal, whether the device is set at the first site or the opposite site opposite to the first site, or is disposed in the MCU, The information superimposing device will each receive an audio signal of the first site and at least one video signal containing a plurality of video objects.
  • Step 101 Obtain indication information, where the indication information is used to indicate a video area where a video object corresponding to the acquired audio signal is among the plurality of video objects of the at least one video signal;
  • the caption information can be displayed under the image of the participant who is speaking to facilitate communication between the participants of the site.
  • the information superimposing device A further indication for indicating the corresponding audio signal in the video signal is further obtained. The indication information of the video area where the frequency signal is located.
  • Step 102 Perform superimposition processing on the text information corresponding to the acquired audio signal and the video signal according to the indication information, so that the text information is displayed in the video area indicated by the indication information.
  • the information superimposing device may specify the text information corresponding to the audio signal in the indication information.
  • the video area is superimposed with the video signal, and the audio signal is superimposed in the video area corresponding to the video signal by text information, that is, superimposed with the corresponding video object, when the superimposed processed video signal is displayed in
  • the participant in the peer site can view the participant on the display screen around the image corresponding to the participant who is currently speaking in the first site.
  • the subtitle information corresponding to the speech content ensures the consistency of the display orientation of the image and the subtitle.
  • the method for superimposing the auxiliary information of the video signal in the embodiment is applied to the video conference application scenario of the multi-image, and is used to indicate that the current audio signal is in the video signal before performing superimposition processing on the text information corresponding to the audio signal and the video signal.
  • FIG. 2 is a flowchart of Embodiment 2 of a method for superimposing auxiliary information of a video signal according to an embodiment of the present invention.
  • the method in the embodiment is a point-to-point connection between sites, for example, a communication connection is established between the first site and the second site, and the first site is a signal collecting and transmitting terminal, and the second site is a signal receiving and displaying terminal.
  • the information superimposing device disposed in the second site acquires the audio signal, the video signal, and the indication information, and how to superimpose the auxiliary signal on the video signal, and display the superimposed processed video signal It was explained.
  • the method in this embodiment includes the following steps:
  • Step 200 The second site receives an audio signal sent by the first site and at least one video signal that includes multiple video objects.
  • the audio signal and the video signal are directly sent to the second site through the network.
  • the video signal of the multiple video objects including the first site may be a video signal or multiple video signals, that is, the correspondence between the video signal and the video object may be a one-to-one or one-to-many relationship.
  • the specific correspondence varies according to the settings between the camera and the participants in the first venue. For example: In the first venue, three cameras are used, and three video signals can be obtained through the three cameras; or a wide-angle camera/panoramic camera can be used to capture one complete image of the first venue through this camera or camera. Video signal.
  • Step 201 The second site acquires indication information for indicating a video area where the video object corresponding to the audio signal in the video signal is located; After the second site receives the audio signal and the video signal sent by the first site, in order to be able to superimpose the text information corresponding to the audio signal with the corresponding video object in the corresponding video signal, the second site will further obtain the indication.
  • the indication information of the video area where the video object corresponding to the audio signal is located in the video signal, and the indication information may be specifically used to indicate the position of the participant currently in the speaking state in the first conference site in the first conference site. Image location information.
  • the video signal received by the second site may be one or more.
  • the indication information is only used to indicate that the video signal corresponds to the current audio signal.
  • the position of the video where the video object is located that is, the position of the image in the video signal in the participant who is currently speaking in the first site; and when the video signal is multiple, the indication information is used not only to indicate the audio
  • the video object corresponding to the signal, that is, the participant currently in the speaking state in the first site is in addition to the video position in the corresponding video signal, and is also used to indicate that the plurality of video signals currently correspond to the audio signal.
  • the video signal that is, the first video signal, so that based on the indication information, the second site can not only know in which video signal the video object corresponding to the participant currently in the speaking state is included in the first site, but also It is possible to know the specific image position of the participant currently in the speaking state in the corresponding video signal.
  • the step of obtaining the indication information by the second site may be specifically implemented in the following manners:
  • the second site may extract the indication information from the received audio signal and video signal.
  • the extraction of the indication information by the second site may be implemented in two ways: In the first manner, the second site may be based on the received audio signal, from the audio signal.
  • the sound source orientation information for indicating the orientation of the current audio signal in the first conference field is extracted, and the extracted sound source orientation information is converted into a video corresponding to the indicated audio signal according to the correspondence between the sound source orientation and the video orientation.
  • the image location information of the video area in which the object is located, and the image location information is specific indication information.
  • the audio signal transmitted between the venues is usually a multi-channel signal
  • the second conference field can be the energy of each channel signal in the multi-channel signal.
  • the size is compared, so that according to the comparison result, the second site can discriminate the channel signal in which the energy is the largest, and the sound source direction corresponding to the channel signal with the largest energy is the sound source orientation corresponding to the current audio signal. Therefore, the second site can determine, according to the correspondence between each channel signal and the horizontal orientation stored by itself, the orientation corresponding to the channel signal with the largest energy as the sound source orientation of the current audio signal, thereby extracting the sound source orientation according to the sound source orientation. information.
  • the second site may also extract the image location information directly from the plurality of video signals based on the one or more video signals sent by the first site, that is, extract the indication information. Specifically, after receiving the video signal of the first site, the second site may capture and detect the motion state of the lips of each video object included in the image of the video signal, that is, each participant, that is, the video signal is detected. In the corresponding image, whether the lips of each participant have a movement of opening and closing, thereby determining a video signal corresponding to the participant currently in the speaking state in the first meeting place, and the corresponding participant in the corresponding field The position of the image in which the video signal is located.
  • the second site is based on each video signal stored by itself.
  • the image position information of the participant currently in the speaking state in the first site can be determined by the corresponding relationship with the image location, and the above indication information can also be obtained.
  • the second site may also directly receive the indication information sent by the first site, that is, the step of extracting the indication information from the audio signal and the video signal is performed at the first site, by the first
  • the conference directly sends the indication information to the second conference site.
  • the first site may send the indication information to the second site together with the collected audio signal and the video signal, so that the second site can directly learn the current information according to the indication information.
  • the manner in which the first site determines the indication information locally may be implemented in other manners: for example, if the first site is for each parameter The participants are all provided with corresponding microphones. When any participant in the first site speaks, the microphone device corresponding to the participant in the speaking state can record the sound source orientation information corresponding to the current audio signal, thereby The location information of the sound source can be converted into corresponding image position information, and when the audio signal and the video signal are sent to the second venue, the image location information is sent together; secondly, if the first venue is for each participant The microphone array is set.
  • the microphone array can also acquire the audio orientation information corresponding to the audio signal while acquiring the audio signal. Again, even if the microphone array is not set in the opposite site, There is no corresponding microphone for each participant, and the first venue can also The indication information is manually input by the field management personnel by means of manual input, so that the first conference site can also send the indication information to the second conference site. In the embodiment of the present invention, there may be multiple methods for determining the video signal corresponding to the local audio signal in the first site, and the embodiment of the present invention does not limit this.
  • the audio signal received by the second site in the above step 200 may be more One.
  • the indication information acquired by the second site should also be correspondingly equal to the number of audio signals, and each indication information is used to indicate the video signal corresponding to the corresponding audio signal, And the video objects corresponding to the corresponding audio signals are each in a video area in the corresponding video signal.
  • Step 202 The second site acquires the dummy gesture information corresponding to the audio signal and/or the basic identity information corresponding to each participant of the first site;
  • the text information corresponding to the audio signal is superimposed with the corresponding video signal at the second site.
  • the second site can also obtain the mute gesture information corresponding to the current audio signal; and in order to facilitate the communication between the participants, the second site can also obtain the participants of the first site, that is, the videos.
  • the basic identity information corresponding to the object is used to superimpose the dummy gesture information and the basic identity information together with the text information in the corresponding video signal during the video signal superposition processing.
  • the basic identity information is specifically in a text format, and the content included therein may specifically be related information such as the name, title, and the like of each participant.
  • the dummy gesture information may be received by the second site after receiving the audio signal.
  • the audio signal is converted at the local end, or the dummy gesture information may be converted into an * language gesture information by the first site after the audio signal is collected, and carried in the audio signal to be sent to the audio signal.
  • the second venue In either case, if the dummy gesture information is directly converted by the audio signal, when the dummy gesture information is superimposed on the corresponding video signal and displayed on the corresponding display screen, the displayed is A virtual person with a mute gesture. In practical applications, if the audio signal is directly converted into the mute gesture information, the mute gesture information may be added to the first site of the opposite end to translate the speech content of the participant into a * language gesture.
  • the translator and by setting a corresponding camera for the translator, to obtain the video signal of the captured translator through the network to the second venue.
  • the video signal corresponding to the translator is the * language gesture information corresponding to the current audio signal, and when the dummy gesture information is superimposed on the corresponding video signal, and displayed on the corresponding display screen At the time, what is displayed is a real person who plays a mute gesture.
  • the second site For the basic identity information of each participant in the first site acquired by the second site, if the second site acquires the basic identity information before the video signal is superimposed, the video signal is superimposed.
  • the second site in addition to superimposing the text information corresponding to the audio signal and the * language gesture information and the specified video signal in the specified video region, also indicates the basic identity information of the video object specified in the information.
  • the specified video signals are superimposed together.
  • Step 203 The second site acquires text information corresponding to the audio signal.
  • the second site needs to convert the audio signal into a corresponding image signal by superimposing the text information corresponding to the audio signal in the corresponding video signal.
  • Text letter Interest Specifically, in this step, after receiving the audio signal, the second site may perform voice recognition processing on the audio signal locally to generate text information corresponding to the audio signal, and it is necessary to explain that the text information is converted. It can also be performed in the first conference site. After the audio signal is collected, the first conference site can perform voice recognition on the local end to generate text information corresponding to the audio signal, thereby transmitting the audio signal to the second.
  • the corresponding text information is sent to the second site together; or the text information can also be manually input by the conference administrator of the first site.
  • the audio signal is converted into text information, or conference management.
  • the user can select to convert the audio signal into a plurality of text information corresponding to different languages to display subtitle information in various languages in the display screen.
  • Step 204 The second site performs superimposition processing on the text information, the dummy gesture information, and/or the basic identity information of the video object specified by the indication information and the video signal specified in the indication information;
  • the second site is under the instruction of the indication information.
  • the basic information corresponding to the video object specified in the text information, the dummy gesture information, and/or the indication information, and the video signal specified in the indication information may be superimposed in the specified video area to superimpose various auxiliary information.
  • the second site is obtained from the first site.
  • a plurality of audio signals are obtained
  • the second site is obtained from the first site by a plurality of indication information corresponding to the plurality of audio signals
  • the second site is connected to the video.
  • the text information corresponding to each audio signal and the video signal specified in the respective indication information are respectively superimposed and processed at a specified video position according to each indication information corresponding to each audio signal.
  • the dummy gesture information and/or the corresponding participant's basic identity information may also be superimposed in the video signal.
  • Step 205 The second site performs superimposition processing on the video signals other than the specified video signal and the corresponding basic identity information.
  • the second site has auxiliary information such as text information corresponding to the audio signal, * gesture information, and the corresponding video. If the signal is superimposed in the specified video area, if the second site also obtains the basic identity information of each participant in the first site corresponding to each video signal in the first site, In this step, the second site may further superimpose the basic identity information corresponding to the video objects other than the video object specified by the indication information and the video signals of the video objects in the corresponding video regions, thereby When the superimposed video signals are displayed on the corresponding display screens in the second site, the participants of the second site can also see the parameters near the images of all the participants in the first site displayed on the display screen. Basic information about each person.
  • Step 206 The second site displays the superposed processed video signals on the corresponding display screens respectively. After the auxiliary information is superimposed with the corresponding video signal, the second video field respectively displays the processed video signals on the corresponding display screen, and the superimposition operation of the video signal in the second site is in the indication information.
  • the subtitle text information corresponding to the audio signal, the *language gesture information, and the basic identity information of each participant can be accurately superimposed on the video position corresponding to the participant currently in the speaking state in the first conference site. Therefore, it is ensured that when the superimposed video signal is displayed in the display screen of the second site, the image corresponding to the participant currently in the speaking state is completely consistent with the display orientation of each auxiliary information.
  • each participant in the second site can view the image of the participant who is speaking in the opposite site on the display screen, and can also The subtitle information corresponding to the speech of the participant and the basic identity information of the participant are seen around the image. Further, when there is a deaf person in the second conference participant, the ⁇ * person's ginseng The participant can also directly see the mute gesture corresponding to the spoken content on the display screen, which greatly facilitates the communication between the second venue and the participants of the first venue.
  • the foregoing steps of the present embodiment describe the acquisition of various auxiliary information and indication information, and the superposition processing of the video signal are in the second conference site.
  • these steps may also be performed in the first site, that is, the first site acquires the indication information, and after the video signal and each auxiliary information are superimposed and processed under the instruction of the indication information, the superimposed processing is performed.
  • the video signal is directly sent to the second site, and corresponding to the implementation manner, the second site does not need to perform any superposition processing on the received video signal.
  • the above-described effects described in this embodiment can also be obtained by directly displaying the received video signal on the display screen.
  • FIG. 4 is a schematic diagram showing the effect of displaying a superimposed video signal on a multi-screen according to an embodiment of the present invention.
  • the second site will be The text information generated by the audio signal, the ffi gesture information converted from the text information, the basic identity information of the participant corresponding to the video signal of sequence number 2, and the video signal of sequence number 2 are superimposed, and finally processed.
  • Each of the video signals will be displayed on a plurality of screens of the second venue, respectively. As shown in FIG.
  • text information corresponding to the audio signal may be displayed under the corresponding image, and basic identity information may be displayed above the corresponding image, and the dummy gesture information is displayed. It can be displayed on either side of the corresponding image, thus ensuring the consistency of the displayed auxiliary information with the image.
  • the second site also superimposes the video signals other than the specified video signal and the corresponding basic identity information, in the effect diagram shown in FIG.
  • the basic identity information of each of the three participants will also be displayed near the display images of the other three participants.
  • the specific display position of the various auxiliary information on the display screen can be determined according to specific needs, and the embodiment of the present invention does not limit this.
  • Step 207 The second site plays the audio signal according to the sound source orientation information corresponding to the indication information.
  • the second site will also process the audio signal according to the sound source orientation information corresponding to the indication information.
  • the audio signal transmitted by the peer site is played according to the sound source orientation information.
  • the sound source orientation information corresponding to the indication information is extracted according to the audio signal, and the correspondence between the sound source orientation and the video orientation is utilized.
  • the relationship conversion obtains the indication information.
  • the second site will directly play the received audio signal according to the extracted sound source orientation information.
  • the step 201 is to obtain the indication information according to the lip motion detection of the video object in the video signal, the second site will also utilize the sound source orientation and the video orientation.
  • Corresponding relationship between the obtained indication information is converted into corresponding sound source orientation information, and then the received audio signal is played according to the sound source orientation information to ensure the consistency between the sound and the image of the second venue end.
  • the audio signal sent by the first site is a multi-channel signal
  • the multi-channel signal itself contains the sound source azimuth information
  • the second site is directly The multi-channel signal is played by a corresponding number of speakers at the site, so that the played sound has a sense of orientation. Therefore, in this step, if the situation is corresponding, there is no need to pair the audio signal according to the sound source orientation information. For additional processing, simply play the multi-channel signal directly with the corresponding number of speakers.
  • the auxiliary information superimposing method of the video signal in this embodiment is applied to a multi-video image conference system, and the text information and the video signal corresponding to the audio signal are input in the second conference site.
  • the indication information for indicating the video area where the video object corresponding to the current audio signal is located in the video signal is acquired, and when the video signal is superimposed, the current audio signal is corresponding according to the indication information.
  • the text information is superimposed with the video signal in a video area in which the video object corresponding to the audio signal is located, so that when the superimposed processed video signal is displayed on the display screen of the corresponding venue terminal, the audio information can be guaranteed
  • the text information corresponding to the signal is displayed around the image of the corresponding video object, which ensures the consistency of the display orientation of the image and the subtitle.
  • the mute gesture information corresponding to the audio signal and/or the peer site are respectively
  • the participant's basic identity information is acquired, and the dummy message processing and/or the basic identity information and the corresponding video object are simultaneously performed while superimposing the text information corresponding to the participant currently in the speaking state.
  • the superimposition is performed, so that not only the basic information of each participant of the opposite site is displayed at the corresponding position of the display screen of the display site of the display site, but also the mute of the speaker of the opposite site is displayed at the corresponding position. Gestures further facilitate communication between participants.
  • FIG. 5 is a flowchart of Embodiment 3 of a method for superimposing auxiliary information of a video signal according to an embodiment of the present invention.
  • the method in this embodiment takes the point-to-multiple connection mode of the MCU as an example, how to acquire the audio signal, the video signal, and the indication information, and how to superimpose the auxiliary information on the video signal.
  • the specific process of sending the superimposed video signal to the desired venue for display is explained.
  • the site map shown in Figure 3 as an example, the three parts of the display screen shown in Figure 3
  • the points can display the image information of the participants from 3 different venues. In this video conference, 4 conference venues participated in the conference.
  • the method in this embodiment mainly includes the following steps:
  • Step 300 the first site sends the collected audio signal and at least one video signal including multiple video objects to the MCU;
  • the first site can be connected to the other sites through the MCU. Therefore, after the MCU receives the connection request between multiple sites, and establishes a connection between multiple sites, The MCU can send the received audio signal and video signal to the established connection relationship between the audio and video signals sent by the site after receiving the audio and video signals sent by the site.
  • Other sites, and in this embodiment, the superimposition processing of the video signal and the auxiliary information may also be performed by the MCU.
  • the video signal sent by the first site received by the MCU includes multiple video objects of the first site.
  • Step 301 The MCU acquires indication information for indicating a video area where the video object corresponding to the audio signal in the video signal is located;
  • the MCU can also extract the indication information from the received audio signal and the video signal, or extract the indication information from the first site to send the indication information directly to the MCU, and the MCU or the first site from the audio signal or
  • For the implementation method of extracting the indication information from the video signal refer to the description of the previous embodiment.
  • the video signal received by the MCU may also be one or more.
  • the indication information is only used to indicate the current audio signal pair with the video signal.
  • the video location where the video object is located that is, the image location of the participant in the first conference site that is currently in the speaking state in the video signal; and when the video signal is multiple, the indication information is used for indication
  • the video object corresponding to the audio signal that is, the participant currently in the speaking state in the first conference field, is used to indicate the current and the audio signal in the plurality of video signals, in addition to the video position in the corresponding video signal.
  • the MCU can not only know in which video signal the video object corresponding to the participant currently in the speaking state is included in the first site, but also can know that the current video is present.
  • the specific image position of the participant in the speaking state in the corresponding video signal is not only know in which video signal the video object corresponding to the participant currently in the speaking state is included in the first site, but also can know that the current video is present. The specific image position of the participant in the speaking state in the corresponding video signal.
  • the audio signal sent by the first site in the foregoing step 300 may also be multiple.
  • the indication information acquired by the MCU should also be correspondingly equal to the number of audio signals, and each indication information is used to indicate the video signal corresponding to the corresponding audio signal, and The video objects corresponding to the corresponding audio signals are each in a video region in the corresponding video signal.
  • Step 302 The MCU acquires the dummy gesture information corresponding to the audio signal and/or the basic identity information corresponding to each participant of the first conference site;
  • the text information corresponding to the audio signal and the corresponding video signal are superimposed on the MCU.
  • the MCU can also obtain the mute gesture information corresponding to the current audio signal, and the basic identity information corresponding to each participant in the first site, that is, each video object included in the video signal, the basic identity information.
  • the content is specifically in a text format, and the content included therein may specifically be related information such as the name, position, and the like of each participant.
  • the method for the MCU to obtain the dummy gesture information and the basic identity information of the participant can also be referred to the description when the corresponding step is performed on the second site in the previous embodiment.
  • Step 303 The MCU acquires text information corresponding to the audio signal.
  • the MCU may perform voice recognition processing on the audio signal locally to generate text information corresponding to the audio signal; or, the text information conversion operation may also be performed in the first site.
  • the first site can perform voice recognition on the local end to generate text information corresponding to the audio signal, so that the corresponding text information is sent to the MCU while the audio signal is sent to the MCU. Send it to the MCU together; or the text message can also be manually input by the conference administrator of the first site.
  • the audio signal is converted into text information, or the conference administrator manually When inputting text information, the audio signal can be selected to be converted into a plurality of text information corresponding to different languages to display subtitle information in various languages in the display screen.
  • Step 304 The MCU superimposes the basic identity information of the video object specified by the text information, the dummy gesture information, and/or the indication information with the video signal specified in the indication information;
  • the MCU After obtaining the text information corresponding to the audio signal, the indication information for indicating the video object corresponding to the audio signal, and the dummy gesture information and/or the basic identity information, in this step, the MCU, under the instruction of the indication information, Text message, mute gesture
  • the basic identity information corresponding to the video object specified in the information and/or the indication information and the video signal specified in the indication information are superimposed in the specified video region to superimpose various auxiliary information in the video signal corresponding to the audio signal.
  • the MCU obtains a plurality of audio signals from the first site
  • the MCU obtains from the first site respectively corresponding to the plurality of audio signals.
  • the MCU when the MCU superimposes the video signal, the MCU should also specify the text information corresponding to each audio signal and the respective indication information according to each indication information corresponding to each audio signal.
  • the video signal is superimposed at a specified video position.
  • the text information is superimposed, the dummy speech information and/or the corresponding basic identity information of the participant may be superimposed in the video signal.
  • Step 305 The MCU superimposes the video signals other than the specified video signal with the corresponding basic identity information.
  • Step 306 the MCU sends the audio signal and the processed video signal to a plurality of second conference sites connected to the first conference site;
  • Step 307 The second site displays the superimposed processed video signals on the corresponding display screens respectively;
  • the MCU After superimposing the acquired auxiliary information and the corresponding video signal, the MCU sends the audio signal and the processed video signals to a plurality of second conference sites that have established a communication connection with the first conference site. Therefore, for each second site, since various types of auxiliary information are superimposed on each video signal received by the second site, the second site does not need to perform additional processing on the received video signal, but can directly receive the video signal. Arrived A plurality of video signals are displayed on respective display screens, and in all the auxiliary information displayed, each auxiliary information is consistent with the orientation of the corresponding image.
  • the acquisition of various auxiliary information and indication information and the superposition processing of the video signal described in the above steps of the embodiment are performed in the MCU.
  • the steps may be performed in the first site or in a plurality of second sites in which the MCU is connected to the first site, that is, the first site acquires the indication information, and the indication information is
  • the superimposed video signal is sent to the plurality of second sites through the MCU; or the second site receives the unprocessed audio signal and the video signal, and then obtains the indication information and Auxiliary information to superimpose the video signal and the auxiliary information.
  • the second site can also obtain the above-described effects described in this embodiment after displaying the superimposed video signal.
  • Step 308 The second site plays the audio signal according to the sound source orientation information corresponding to the indication information.
  • the second site The audio signal is also processed according to the sound source orientation information corresponding to the indication information to play the audio signal transmitted by the opposite site according to the sound source orientation information.
  • the method for superimposing the auxiliary information of the video signal in the embodiment is applied to the multi-video image conference system, and is configured to indicate that the current audio signal is in the video signal before the MCU superimposes the text information corresponding to the audio signal with the video signal.
  • Corresponding view The indication information of the video area where the frequency object is located, and when the video signal is superimposed, according to the indication information, the text information corresponding to the current audio signal is in the video area where the video object corresponding to the audio signal is located and the video signal Performing superimposition processing so that when the superimposed processed video signal is displayed on the display screen of the corresponding venue terminal, it can be ensured that the text information corresponding to the audio signal is displayed around the image of the corresponding video object, and the image is ensured Consistency with the display orientation of the subtitles.
  • the voice message corresponding to the audio signal and/or the conference of the opposite site are attended.
  • Obtaining the basic identity information of the person, and superimposing the text information corresponding to the video signal corresponding to the participant currently in the speaking state, and superimposing the dummy gesture information and/or each basic identity information with the corresponding video object Therefore, not only the basic information of each participant of the opposite site is displayed at the corresponding position of the display screen of the display site at the display end, but also the dummy speech gesture of the speaker of the opposite site is displayed at the corresponding position. It further facilitates communication between participants.
  • FIG. 6 is a schematic structural diagram of Embodiment 1 of an auxiliary information superimposing apparatus for a video signal according to an embodiment of the present invention.
  • the auxiliary information stack of the video signal of this embodiment The adding device includes at least: a signal acquiring module 11, an indication information acquiring module 12, and a signal superimposing module 13.
  • the signal acquisition module 11 is configured to acquire an audio signal of the first site and at least one video signal of the first site, where the at least one video signal includes multiple video objects in the first site;
  • the indication information acquiring module 12 is configured to obtain an indication. And the indication information is used to indicate a video area where the video object corresponding to the acquired audio signal is located in the plurality of video objects of the at least one video signal acquired by the signal acquisition module 11;
  • the signal superposition module 13 is used to And superimposing the text information corresponding to the audio signal of the first site and the video signal acquired by the signal acquiring module 11 according to the indication information acquired by the indication information acquiring module 12, so that the text information is in the video area indicated by the indication information.
  • the auxiliary information superimposing apparatus of the video signal of this embodiment may be disposed at the venue terminal or in the MCU. If it is set in the terminal, the device in this embodiment may be disposed in the first site, and before the audio signal and the video signal are sent to the second site in the first site, the video signal is subjected to corresponding information superposition processing, or the embodiment of the present embodiment.
  • the device may also be disposed in the second conference site, and after receiving the audio signal and the video signal sent by the first site, the second conference site performs corresponding information superposition processing on the video signal; and if the setting in the embodiment is in the MCU, The device of the embodiment may perform corresponding information superposition processing on the video signal after receiving the audio signal and the video signal transmitted by any site.
  • the auxiliary information superimposing apparatus of the video signal of the embodiment is applied to a video conference application scenario of a multi-image, and is configured to indicate that the current audio signal is in the video signal before performing superimposition processing on the text information corresponding to the audio signal and the video signal.
  • FIG. 7 is a schematic structural diagram of Embodiment 2 of an auxiliary information superimposing apparatus for a video signal according to an embodiment of the present invention.
  • the video signal acquired by the signal acquiring module 11 may be one or more.
  • the video area indicated by the indication information acquired by the instruction information acquisition module 12 is the video object corresponding to the audio signal in the video corresponding to the first video signal.
  • the video position, the first video signal is a video signal corresponding to the audio signal among the plurality of video signals. If the video signal obtained by the signal acquisition module 11 is one, the video area indicated by the indication information acquired by the information acquisition module 12 is the video of the video object corresponding to the audio signal in the video signal of the first site. position.
  • the indication information obtaining module 12 may include at least one of the following sub-modules: a first information acquisition sub-module 121 or a second information acquisition sub-module 122.
  • the first information acquiring sub-module 121 is configured to: if the audio signal of the first site is a multi-channel signal, determine that the direction corresponding to the channel signal with the largest energy in the multi-channel signal is the audio source of the video object corresponding to the audio signal.
  • the second information acquisition sub-module 122 is configured to detect the lip motion of the participant in the video signal of the first site respectively, and determine that the participant having the opening and closing motion of the lip corresponds to the audio signal. A video object and an indication of the video area in which the video object is located.
  • the auxiliary information superimposing apparatus of the video signal may further include the auxiliary information acquiring module 14.
  • the auxiliary information acquiring module 14 is configured to obtain the dummy gesture information corresponding to the audio signal and/or the first before the signal superimposing module 13 performs the superimposition processing on the text information corresponding to the audio signal of the first site according to the indication information.
  • the signal superimposition module 13 in this embodiment may be further configured to: use the mute gesture information acquired by the auxiliary information acquiring module 14 and/or the basic identity information of each participant in the first site with the The video signal is subjected to superimposition processing such that the dummy speech information and/or basic identity information of the video object are displayed in the video area indicated by the indication information.
  • the auxiliary information superimposing apparatus of the video signal of the embodiment may further include a signal display module 15.
  • the signal display module 15 is configured to: after the signal superimposing module 13 superimposes the text information corresponding to the audio signal of the first site according to the indication information, and superimposes the processed video signal on the corresponding display Displayed on the display screen.
  • the auxiliary information superimposing apparatus of the video signal of the embodiment may further include any one of the first signal playing module 161 or the second signal playing module 162.
  • the first signal playing module 161 is configured to: when the indication information is converted according to the sound source orientation information of the audio signal, and the corresponding relationship between the sound source orientation and the video orientation, the signal superimposition module 13 and the first information according to the indication information After the text information corresponding to the audio signal of the site is superimposed with the video signal, the audio signal is played according to the sound source orientation information of the audio signal; and the second signal playing module 162 is used when the indication information is obtained according to the lip motion detection.
  • the signal superimposing module 13 superimposes the text information corresponding to the audio signal of the first site according to the indication information, and uses the correspondence between the sound source orientation and the video orientation to obtain the sound source orientation information of the audio signal, and according to The sound source orientation information of the audio signal plays the audio signal.
  • the auxiliary information superimposing apparatus of the video signal of the embodiment is applied to a video conference application scenario of a multi-image, and is configured to indicate that the current audio signal is in the video signal before performing superimposition processing on the text information corresponding to the audio signal and the video signal.
  • the mute gesture information corresponding to the audio signal and the participation of the peer site corresponding to each video signal are also participated.
  • the basic information of each participant in the opposite site is displayed at the corresponding position of the display screen of the local site, and the mute gesture of displaying the speech of the speaker of the opposite site in the corresponding position is further realized, which is further convenient. Smooth communication between participants.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Description

说 明 书
视频信号的辅助信息叠加方法及装置 技术领域
本发明涉及多屏视频通信技术领域,尤其涉及一种视频信号的辅 助信息叠加方法及装置。 背景技术
远程呈现技术是一种近几年出现的将视频通信与沟通体验融为 一体的远程会议技术, 由于其具有真人大小、超高清晰、低延时的特 点, 注重真实面对面沟通的效果, 因此具有较强的真实感和临场感, 在各种视频会议场景中得到了广泛的应用。
远程呈现会议系统注重声音方位和图像的一致性,因而能够很好 地满足参会者对图像及声音的需求,但是现有的远程呈现技术仍然存 在一定的不足:在一次远程呈现会议中,很有可能出现持不同语种的 参会者,从而不同语种的参会者之间可能会存在语言障碍,尤其是听 力障碍; 同时, 即使是在持相同语种的参会者之间, 当参会者注意力 没有集中或者因为其他客观原因,也有可能导致出现参会者无法听清 楚对方的说话内容的情况。因而考虑到这些情况,在远程呈现会议场 景中,若将参会者的说话内容以字幕的形式显示在屏幕下方,将会很 大程度上方便参会者之间的交流。
现有传统的视频会议系统中,已经存在有将参会者的语音信号转 换成字幕信息, 同图像一起显示在屏幕上的各种技术,但是这些字幕 显示技术均没有考虑到远程呈现会议的特征场景,因此若将该技术直 接应用于远程呈现会议场景时会存在一些缺陷:例如远程呈现会场中 通常有包括多个屏幕,该多个屏幕用于分别显示远端会场的多个参会 者,而若直接按照传统的视频会议字幕显示方法,将无法得知应该将 字幕信息显示在哪个屏幕中。而若直接将字幕信息显示在中间的屏幕 上, 当发言人位于左屏或右屏时,这种字幕显示方式将导致图像和字 幕显示方位的不一致,从而使得本地参会者只能择其一地观看发言人 的图像或字幕, 给参会者带来了不便。 发明内容
本发明实施例提供一种视频信号的辅助信息叠加方法及装置,用 以克服现有的远程呈现会议技术中出现的字幕与图像的显示方位不 一致的缺陷。
为实现上述目的,本发明实施例提供一种视频信号的辅助信息叠 加方法, 包括:
获取第一会场的音频信号及第一会场的至少一个视频信号,所述 至少一个视频信号包含所述第一会场中的多个视频对象;
获取指示信息,所述指示信息用于指示在所述至少一个视频信号 的多个视频对象中、与所述音频信号对应的视频对象所处的视频区域; 根据所述指示信息将与所述第一会场的音频信号对应的文本信 息与所述视频信号进行叠加处理,以使所述文本信息在所述指示信息 所指示的视频区域中显示。
为实现上述目的,本发明实施例还提供一种视频信号的辅助信息 叠加装置, 包括:
信号获取模块,用于获取第一会场的音频信号及第一会场的至少 一个视频信号,所述至少一个视频信号包含所述第一会场中的多个视 频对象;
指示信息获取模块, 用于获取指示信息, 所述指示信息用于指示 在所述至少一个视频信号的多个视频对象中、与所述音频信号对应的 视频对象所处的视频区域;
信号叠加模块,用于根据所述指示信息将与所述第一会场的音频 信号对应的文本信息与所述视频信号进行叠加处理,以使所述文本信 息在所述指示信息所指示的视频区域中显示。
本发明实施例提供的视频信号的辅助信息叠加方法及装置,应用 在多屏视频通信场景中,通过在将音频信号对应的文本信息与视频信 号进行叠加处理之前,获取用于指示当前音频信号在视频信号中所对 应的视频对象所处的视频区域的指示信息,并在对视频信号进行叠加 处理时,根据该指示信息将当前音频信号对应的文本信息在该音频信 号对应的视频对象所处的视频区域中与视频信号进行叠加处理,从而 使得当将经叠加处理后的视频信号被显示在对应会场终端的显示屏 幕上时,能够保证与音频信号对应的文本信息显示在相对应的视频对 象的图像周围, 保证了图像与字幕的显示方位的一致性。 附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面 将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显 而易见地, 下面描述中的附图是本发明的一些实施例, 对于本领域普 通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以根据这些 附图获得其他的附图。
图 1 为本发明实施例提供的视频信号的辅助信息叠加方法实施 例一的流程图;
图 2 为本发明实施例提供的视频信号的辅助信息叠加方法实施 例二的流程图;
图 3 为本发明实施例视频信号的辅助信息叠加方法所应用的 会场的示意图;
图 4 为本发明实施例中叠加后的视频信号在多屏幕上的显示 效果示意图;
图 5 为本发明实施例提供的视频信号的辅助信息叠加方法实施 例三的流程图;
图 6 为本发明实施例提供的视频信号的辅助信息叠加装置实施 例一的结构示意图;
图 7 为本发明实施例提供的视频信号的辅助信息叠加装置实施 例二的结构示意图。 具体实施方式 为使本发明实施例的目的、技术方案和优点更加清楚, 下面将结 合本发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例是本发明一部分实施例, 而不是 全部的实施例。基于本发明中的实施例, 本领域普通技术人员在没有 做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护 的范围。
图 1 为本发明实施例提供的视频信号的辅助信息叠加方法实施 例一的流程图, 如图 1所示, 本实施例的方法包括如下步骤:
步骤 100, 获取第一会场的音频信号及第一会场中包含多个视频 对象的至少一个视频信号;
本发明实施例的视频信号的辅助信息叠加方法可以应用在远程 呈现会议系统或者其他类型的在会场中设置了分别与对端会场的多 名参会者的对应的多个显示屏幕的视频通信系统中,且本发明实施例 中的各步骤可以由设置在会场或服务器中的信息叠加装置执行。由于 在这些视频会议系统中,每个会场终端均设置了用于分别显示对端会 场的多名参会者的多个显示屏幕, 因而对于每个会场终端而言, 在与 对端会场建立了正常的通信连接后,该会场终端均将向对端会场发送 包含了本端会场的多名参会者、即本端会场的多个视频对象的至少一 个视频信号, 以及本端会场当前的音频信号, 该音频信号由本端会场 的多名参会者中当前正处于发言状态的参会者所产生, 具体地, 该音 频信号与视频信号中包含的多个视频对象中的一个相对应。
由于对于远程呈现会议系统而言, 通常具有两种连接方式, 一种 是两个会场之间通过网络进行的点对点连接方式,而另一种则是多个 会场之间通过设置在多个会场间的多点控制单元 (Multipoint Control Unit, 简称 MCU) 进行的点对多点连接方式。 在会场点对点连接方 式中, 本端会场和对端会场将直接进行信号数据的传输, 而在会场点 对多点的连接方式中, 会场间的信号传输通过 MCU进行转发, 因而 对于信息叠加装置而言, 该装置即可以设置在会场终端, 也可以设置 在 MCU中。 本发明实施例同时对这两种实施方式进行描述, 且在本 发明实施例中, 将对当前音频信号和视频信号进行采集, 并将采集到 的信号发送给对端会场的本端会场称为第一会场。在本步骤中, 对于 用于对视频信号进行辅助信息叠加处理的信息叠加装置而言,无论该 装置被设置在第一会场或者与第一会场相对的对端会场,或者设置在 MCU中, 该信息叠加装置均将接收到第一会场的音频信号以及包含 有多个视频对象的至少一个视频信号。
步骤 101, 获取指示信息, 该指示信息用于指示在上述至少一个 视频信号的多个视频对象中、与获取到的音频信号对应的视频对象所 处的视频区域;
接收到第一会场的视频信号和音频信号后,为了能够将音频信号 对应的文本信息以字幕的形式准确地叠加在对应的视频信号以及视 频信号的对应视频区域中,使当经叠加处理后的视频信号在对端会场 的显示屏幕上进行显示时,字幕信息能够显示在正在发言的参会者的 图像下方, 以方便会场的参会者之间的沟通, 本发明实施例中, 信息 叠加装置还将进一步获取用于指示视频信号中与音频信号对应的视 频信号所在的视频区域的指示信息。
步骤 102, 根据指示信息将与获取到的音频信号对应的文本信息 与视频信号进行叠加处理,以使该文本信息在指示信息所指示的视频 区域中显示。
获取到用于指示当前音频信号所对应的视频对象在视频信号中 所处的视频区域的指示信息后, 根据该指示信息, 信息叠加装置可以 将音频信号所对应的文本信息在指示信息中指定的视频区域与视频 信号进行叠加处理,而通过将音频信号以文本信息的方式叠加在视频 信号对应的视频区域中, 即与对应的视频对象进行叠加, 当该叠加处 理后的视频信号被显示在与第一会场对应的对端会场中时,对端会场 的参会者便可以在显示屏幕上,在与第一会场当前正处于发言状态的 参会者对应的图像周围看到与该参会者的发言内容对应的字幕信息, 保证了图像与字幕的显示方位的一致性。
本实施例的视频信号的辅助信息叠加方法,应用在多图像的视频 会议应用场景中,通过在将音频信号对应的文本信息与视频信号进行 叠加处理之前,获取用于指示当前音频信号在视频信号中所对应的视 频对象所处的视频区域的指示信息,并在对视频信号进行叠加处理时, 根据该指示信息将当前音频信号对应的文本信息在该音频信号对应 的视频对象所处的视频区域中与视频信号进行叠加处理,从而使得当 将经叠加处理后的视频信号被显示在对应会场终端的显示屏幕上时, 能够保证与音频信号对应的文本信息显示在相对应的视频对象的图 像周围, 保证了图像与字幕的显示方位的一致性。 图 2为本发明实施例提供的视频信号的辅助信息叠加方法实施 例二的流程图。图 3为本发明实施例视频信号的辅助信息叠加方法所 应用的会场的示意图。具体地, 本方法实施例是以会场间采用点对点 连接方式为例, 例如以第一会场和第二会场建立通信连接, 且第一会 场为信号采集发送端, 第二会场为信号接收显示端为例, 对设置在第 二会场的信息叠加装置如何对音频信号、视频信号以及指示信息进行 获取, 以及如何对视频信号进行辅助信息的叠加处理, 并将叠加处理 后的视频信号进行显示的具体流程进行了说明。如图 2所示, 本实施 例的方法包括如下步骤:
步骤 200, 第二会场接收第一会场发送的音频信号及包含多个视 频对象的至少一个视频信号;
本步骤中,第一会场从本地采集到当前音频信号以及与多个视频 对象对应的至少一个视频信号后,将该音频信号以及视频信号直接通 过网络发送给第二会场。实际应用中, 该包含第一会场的多个视频对 象的视频信号可以为一个视频信号或者多个视频信号,即视频信号与 视频对象之间的对应关系可以为一对一或者一对多的关系,具体的对 应情况根据第一会场中摄像头与参会者之间的设置的不同而有所不 同。 譬如: 第一会场中采用三个摄像头, 通过这三个摄像头可以得到 三个视频信号; 也可以通过广角摄像头 /全景摄像机, 通过这一个摄 像头或者摄像机就能够拍摄得到第一会场完整图像的 1个视频信号。
步骤 201, 第二会场获取用于指示视频信号中与音频信号对应的 视频对象所处的视频区域的指示信息; 第二会场接收到第一会场发送的音频信号和视频信号后,为了能 够将音频信号对应的文本信息在对应的视频信号中与对应的视频对 象进行叠加,第二会场还将进一步获取用于指示视频信号中与音频信 号对应的视频对象所处的视频区域的指示信息,该指示信息具体可以 为用于指示第一会场中当前正处于发言状态的参会者在第一会场内 所处位置的图像位置信息。
具体地, 由于在上述步骤 200中, 第二会场接收到的视频信号可 以为一个或者多个, 当视频信号为一个时, 该指示信息仅用于指示与 该视频信号中与当前音频信号对应的视频对象所处的视频位置,即第 一会场中当前正处于发言状态的参会者在视频信号中所处的图像位 置; 而当视频信号为多个时, 该指示信息除了用于指示与音频信号对 应的视频对象、即第一会场中当前正处于发言状态的参会者在对应的 视频信号中所处的视频位置之外,还用于指示多个视频信号中当前与 该音频信号所对应的视频信号, 即第一视频信号, 从而基于该指示信 息, 第二会场不仅可以得知在第一会场中, 当前处于发言状态的参会 者所对应的视频对象包含在哪个视频信号中,还能够得知该当前处于 发言状态的参会者在该对应的视频信号中所处的具体图像位置。
而在本实施例中,第二会场获取该指示信息的步骤具体可以通过 下述几种方式实现:
第二会场可以从接收到的音频信号和视频信号中提取出该指示 信息。具体地,第二会场对指示信息的提取又可以通过两种方式实现: 第一种方式, 第二会场可以基于接收到的音频信号, 从音频信号 中提取出用于指示当前音频信号在第一会场中所在方位的音源方位 信息, 以根据音源方位与视频方位之间的对应关系, 将该提取出的音 源方位信息转换为指示音频信号对应的视频对象所处的视频区域的 图像位置信息, 该图像位置信息便为具体的指示信息。具体地, 在实 际应用中, 会场间传输的音频信号通常为多声道信号, 而当音频信号 为多声道信号时,第二会场可以对该多声道信号中的各个声道信号的 能量大小进行比较, 从而根据比较结果, 第二会场可以判别出其中能 量最大的声道信号,该能量最大的声道信号所对应的音源方位便为与 当前音频信号对应的音源方位。从而, 第二会场可以根据自身存储的 各声道信号与水平方位的对应关系,确定与该能量最大的声道信号所 对应的方位为当前音频信号的音源方位,从而据此提取得出音源方位 信息。
第二种方式,第二会场还可以基于第一会场发送的一个或多个视 频信号, 从多个视频信号中直接提取出图像位置信息, 即提取出指示 信息。 具体地, 第二会场可以在接收到第一会场的视频信号后, 对视 频信号的图像中包含的各视频对象、即各名参会者的唇部运动状态进 行捕捉检测, 即检测视频信号所对应的图像中, 各名参会者的嘴唇是 否有开合的运动,从而以确定第一会场中与当前处于发言状态的参会 者所对应的视频信号,以及该名参会者在对应的视频信号中所处的图 像位置。若对应于某个视频信号, 其所对应的图像中某名参会者的唇 部存在开合的运动,则可以确定该视频信号中包含的这名参会者为当 前处于发言状态的参会者,从而第二会场根据自身存储的各视频信号 与图像位置的对应关系,可以确定出第一会场中当前处于发言状态的 参会者在第一会场的图像位置信息, 同样可以得出上述指示信息。
进一步地, 在本发明实施例中, 第二会场还可以直接接收第一会 场发送的该指示信息,即上述从音频信号及视频信号中提取出指示信 息的步骤在第一会场进行,由第一会场依据上述描述的方法从采集到 的音频信号或视频信号中提取出指示信息后,直接将该指示信息发送 给第二会场。 具体地, 第一会场在提取出该指示信息后, 可以将该指 示信息随同采集到的音频信号以及视频信号一起发送给第二会场,从 而使得第二会场能够直接根据该指示信息得知与当前接收到的音频 信号所对应的视频信号,以及与该音频信号对应的视频对象在该视频 信号中所处的视频位置。
而需要说明的是,第一会场在本地确定指示信息的方式除了上述 描述的从音频信号或视频信号中提取之外,还可以通过其他的方式实 现: 例如, 若第一会场中对于每个参会者均设置有对应的麦克风, 在 第一会场的任一参会者发言时,与处于发言状态的参会者相对应的麦 克风设备可以记录下当前音频信号所对应的音源方位信息,从而第一 会场可以将该音源方位信息转换为对应的图像位置信息,在向第二会 场发送音频信号及视频信号时, 将该图像位置信息一起进行发送; 其 次, 若第一会场中对于各参会者设置的是麦克阵列, 基于麦克阵列自 身具备的功能,麦克阵列同样可以在采集音频信号的同时采集到该音 频信号所对应的音频方位信息; 再次, 即使对端会场中既未设置有麦 克阵列, 又未针对每名参会者均设置有对应的麦克风, 第一会场还可 以通过人工输入的方式, 由现场管理人员人工输入该指示信息, 从而 使得第一会场同样能够将该指示信息发送给第二会场。在本发明实施 例中,第一会场确定本地的音频信号所对应的视频信号的方法可以有 多种, 而本发明实施例并不对此进行限制。
需要说明的是,若在第一会场中某一时刻同时存在多名处于当前 发言状态的参会者时, 在本实施例中, 上述步骤 200中第二会场接收 到的音频信号还可以为多个。 此时, 在本步骤 201中, 第二会场获取 的指示信息同样也应该相应的为与音频信号数量相等的多个,且每个 指示信息分别用于指示对应的音频信号所对应的视频信号,以及对应 的音频信号所对应的视频对象各自在对应的视频信号中的视频区域。
步骤 202,第二会场获取与音频信号对应的哑语手势信息和 /或与 第一会场的各参会者对应的基本身份信息;
优选地, 在本实施例中, 为了进一步方便参会者之间的沟通, 当 考虑到有聋 *人参会的场景时,在第二会场对音频信号对应的文本信 息与对应的视频信号进行叠加处理之前,该第二会场还可以获取与当 前音频信号对应的哑语手势信息; 以及为了方便参会者之间的沟通, 第二会场还可以获取与第一会场的各参会者、即各视频对象对应的基 本身份信息, 以在视频信号叠加处理过程中, 将该哑语手势信息以及 基本身份信息与文本信息一起叠加在对应的视频信号中。该基本身份 信息具体为文本格式,而其中包括的内容具体可以为各名参会者的姓 名、 职务等相关的基本信息。
具体地, 该哑语手势信息可以由第二会场在接收到音频信号后, 在本地端对该音频信号进行转换得到, 或者, 该哑语手势信息还可以 由第一会场在采集到音频信号后,将该音频信号转换成 *语手势信息 以将其携带在音频信号中发送给第二会场。无论对应哪种情况, 若哑 语手势信息是由音频信号直接进行转换得到的,当该哑语手势信息被 叠加在对应的视频信号中, 以及被显示在对应的显示屏幕上时, 所显 示的是打哑语手势的虚拟人。而在实际应用中, 若不采用直接将音频 信号转换成哑语手势信息的方案,该哑语手势信息还可以通过在对端 的第一会场中添设负责将参会者的说话内容翻译成 *语手势的翻译 员, 以及通过为该翻译员设置对应的摄像头, 以将拍摄到的翻译员的 视频信号通过网络传送至第二会场而得到。对应这种情况, 该翻译员 所对应的视频信号便为与当前音频信号对应的 *语手势信息,而当该 哑语手势信息被叠加在对应的视频信号中,以及被显示在对应的显示 屏幕上时, 所显示的是打哑语手势的真人。
而对于第二会场获取到的第一会场中各名参会者的基本身份信 息而言, 若第二会场在视频信号进行叠加处理之前, 获取到该基本身 份信息, 在进行视频信号的叠加处理过程中, 第二会场除了将上述与 音频信号对应的文本信息及 *语手势信息和指定的视频信号在指定 的视频区域进行叠加之外,还将指示信息中指定的视频对象的基本身 份信息与指定的视频信号一起进行叠加。
步骤 203, 第二会场获取与音频信号对应的文本信息; 而对于音频信号而言,第二会场为了将与音频信号对应的文本信 息叠加在对应的视频信号中,还需要将音频信号转换为对应的文本信 息。 具体地, 在本步骤中, 第二会场在接收到音频信号后, 可以在本 地对音频信号进行语音识别处理,以生成与音频信号对应的文本信息 而需要说明的是, 该文本信息的转换操作还可以在第一会场中进行, 第一会场可以在采集到音频信号后,便在本地端将音频信号进行语音 识别, 以生成与音频信号对应的文本信息, 从而在将音频信号发送给 第二会场的同时, 将该对应的文本信息一起发送给第二会场; 或者该 文本信息还可以由第一会场的会议管理员通过手工输入得到。而实际 应用中, 考虑到参会者所持的各种不同语种, 无论是该文本信息的获 取是在第一会场或者第二会场进行,在将音频信号进行语音识别转换 为文本信息, 或者会议管理员手工输入文本信息时, 均可以选择将音 频信号转换为对应不同语种的多种文本信息,以在显示屏幕中显示各 种不同语种的字幕信息。
步骤 204, 第二会场将上述文本信息、哑语手势信息和 /或指示信 息指定的视频对象的基本身份信息与指示信息中指定的视频信号进 行叠加处理;
在获取到了与音频信号对应的文本信息、用于指示音频信号对应 的视频对象的指示信息以及哑语手势信息和 /或基本身份信息后, 在 本步骤中, 第二会场在指示信息的指示下, 可以将上述文本信息、 哑 语手势信息和 /或指示信息中指定的视频对象对应的基本身份信息与 指示信息中指定的视频信号, 在指定的视频区域进行叠加处理, 以将 各种辅助信息叠加在视频信号中与音频信号对应的视频对象的周围。
需要说明的是, 若在上述步骤 200中, 第二会场从第一会场获取 到的是多个音频信号, 且在上述步骤 201中, 第二会场从第一会场获 取到的是分别与多个音频信号对应的多个指示信息,在本步骤 204中, 第二会场对视频信号进行叠加处理时,还应当分别根据与各音频信号 对应的各指示信息,将各音频信号对应的文本信息分别与各自的指示 信息中指定的视频信号, 在指定的视频位置进行叠加处理, 优选地, 在叠加文本信息的同时, 还可以在视频信号中叠加哑语手势信息和 / 或对应的参会者的基本身份信息。
步骤 205, 第二会场将除指定的视频信号之外的其他视频信号分 别与对应的基本身份信息进行叠加处理;
进一步优选地,对于除指示信息中指定的视频对象外的其他视频 对象而言, 本发明实施例中, 第二会场除了将音频信号对应的文本信 息、 *语手势信息等辅助信息与对应的视频信号在指定的视频区域进 行叠加处理的同时, 若在上述步骤 202中, 第二会场还从第一会场获 取了与各视频信号对应的、 第一会场的各名参会者的基本身份信息, 在本步骤中,第二会场还可以将除指示信息指定的视频对象外的其他 的视频对象分别对应的基本身份信息分别与这些视频对象所在的视 频信号, 在相应的视频区域进行叠加处理, 从而当这些叠加后的视频 信号在第二会场中对应的显示屏幕上显示时,第二会场的参会者还能 够在显示屏幕上显示的第一会场的所有参会者的图像附近看到这些 参会者各自的基本信息。
步骤 206, 第二会场将叠加处理后的视频信号分别在对应的显示 屏幕上进行显示; 当将各辅助信息与对应的视频信号进行了叠加处理之后,第二会 场将处理后的各视频信号分别在对应的显示屏幕上进行显示,由于第 二会场对视频信号的叠加操作在指示信息的指示下进行,因而与音频 信号对应的字幕文本信息、 *语手势信息及各参会者的基本身份信息 能够被准确地叠加在与第一会场中当前处于发言状态的参会者对应 的视频位置中,从而保证了当叠加后的视频信号被显示在第二会场的 显示屏幕中时,当前处于发言状态的参会者所对应的图像与各辅助信 息的显示方位是完全一致的。
从而对于第二会场中的各名参会者而言,第二会场中的各名参会 者即可以在显示屏幕上看到对端会场中正在发言的参会者的图像,还 可以在该图像周围看到对应该参会者的说话内容的字幕信息以及该 名参会者的基本身份信息, 进一步地, 当第二会场的参会者中有聋哑 人时,该聋 *人的参会者还能够在显示屏幕上直接看到与说话内容对 应的哑语手势,极大地方便了第二会场与对第一会场的参会者之间的 交流沟通。
同时需要说明的是,对于视频会议系统的会场间点对点连接方式, 虽然本实施例的上述步骤描述的对各种辅助信息以及指示信息的获 取, 以及对视频信号的叠加处理均在第二会场中进行, 但是在实际应 用中, 这些步骤也可以在第一会场进行, 即第一会场获取指示信息, 在指示信息的指示下对视频信号与各辅助信息进行叠加处理之后,再 将叠加处理后的视频信号直接发送给第二会场,而对应这种实现方式 而言, 第二会场无需再对接收到的视频信号进行任何叠加处理操作, 直接将接收到的视频信号在显示屏幕上进行显示,也能够得到本实施 例所描述的上述效果。
图 4为本发明实施例中叠加后的视频信号在多屏幕上的显示效 果示意图。 以第一会场中有 4名参会者为例, 当指示信息中指示与音 频信号所对应的视频信号为序号为 2的视频信号时,在本实施例的上 述步骤 204中, 第二会场将音频信号所生成的文本信息、 该文本信息 转换而成的 ffi语手势信息、序号为 2的视频信号所对应的参会者的基 本身份信息与序号为 2的视频信号进行叠加,最终经处理后的各视频 信号将分别被显示在第二会场的多个屏幕上。如图 4所示, 在第二会 场的多个显示屏幕上, 优选地, 与音频信号对应的文本信息可以显示 在相应的图像下方, 基本身份信息可以显示在相应图像的上方, 而哑 语手势信息则可以显示在相应图像的任意一侧,从而保证了显示的辅 助信息与图像的一致性。
而需要说明的是, 若在上述步骤 205中, 第二会场还将除指定的 视频信号之外的其他视频信号分别与对应的基本身份信息进行叠加 处理, 在图 4所示的效果示意图中, 其他 3名参会者的显示图像附近 还将显示这 3名参会者各自的基本身份信息。各种辅助信息在显示屏 幕上的具体显示位置可以根据具体需求而定,而本发明实施例并不对 此进行限制。
步骤 207, 第二会场根据与指示信息对应的音源方位信息播放音 频信号。
与此同时,为了进一步保证第二会场播放的对端会场的发言的参 会者的声音与显示的该参会者的图像具有同方位性, 即具有一致性, 在本发明实施例中,第二会场还将根据与指示信息对应的音源方位信 息对音频信号进行处理,以根据音源方位信息播放对端会场传送的音 频信号。
具体地, 若上述步骤 201中, 第二会场获取的指示信息是从音频 信号中提取得到,即根据音频信号提取出与指示信息对应的音源方位 信息,再利用音源方位与视频方位之间的对应关系转换得到指示信息, 在本步骤中,第二会场将直接根据提取出的音源方位信息播放接收到 的音频信号。而若上述步骤 201对指示信息的获取过程中, 第二会场 是根据视频信号中视频对象的唇部运动检测得到上述指示信息,则在 本步骤中, 第二会场还将利用音源方位与视频方位之间的对应关系, 将获取到的指示信息转换为对应的音源方位信息,再根据该音源方位 信息播放接收到的音频信号,以保证第二会场端声音与图像的一致性。
此外还需要说明的是,若第一会场发送的音频信号为多声道信号, 由于多声道信号本身便包含有音源方位信息,因而在对多声道信号进 行播放时,第二会场直接在会场端采用对应数目的多个扬声器对多声 道信号进行播放, 便能够使播放出的声音具有方位感, 因而在本步骤 中, 若对应此种情况时, 则无需根据音源方位信息对音频信号进行额 外的处理,只需直接采用对应数目的多个扬声器将多声道信号进行播 放即可。
本实施例的视频信号的辅助信息叠加方法,应用在多视频图像会 议系统中,通过在第二会场将音频信号对应的文本信息与视频信号进 行叠加处理之前,获取用于指示当前音频信号在视频信号中所对应的 视频对象所处的视频区域的指示信息,并在对视频信号进行叠加处理 时,根据该指示信息将当前音频信号对应的文本信息在该音频信号对 应的视频对象所处的视频区域中与视频信号进行叠加处理,从而使得 当将经叠加处理后的视频信号被显示在对应会场终端的显示屏幕上 时,能够保证与音频信号对应的文本信息显示在相对应的视频对象的 图像周围, 保证了图像与字幕的显示方位的一致性。
进一步地, 本实施例中, 还通过在第二会场对对端会场发送的视 频信号与音频信号对应的文本信息进行叠加处理之前,对音频信号对 应的哑语手势信息和 /或对端会场的各参会者的基本身份信息进行获 取,在对文本信息与当前处于发言状态的参会者对应的视频信号进行 叠加处理的同时, 将该哑语手势信息和 /或各基本身份信息与对应的 视频对象进行叠加,从而不仅实现了在显示端会场的显示屏幕的相应 位置显示对端会场的各名参会者的基本信息,还实现了在相应位置显 示对端会场的发言者的说话内容一致的哑语手势,进一步地方便了参 会者之间的沟通。
图 5为本发明实施例提供的视频信号的辅助信息叠加方法实施 例三的流程图。 本实施例的方法以 MCU的点对多连接方式为例, 对 设置在 MCU中的信息叠加装置如何对音频信号、 视频信号以及指示 信息进行获取, 以及如何对视频信号进行辅助信息的叠加处理, 并将 叠加处理后的视频信号发送给所需会场进行显示的具体流程进行了 说明。 以图 3所示的会场示意图为例, 图 3所示的显示屏幕的三个部 分可以分别显示来自 3个不同会场的参会者的图像信息,即在此次视 频会议中, 同时有 4个会场参加了此次会议。
如图 5所示, 本实施例的方法主要包括如下步骤:
步骤 300, 第一会场将采集到的音频信号和包含多个视频对象的 至少一个视频信号发送给 MCU;
在本实施例中, 第一会场可以通过 MCU与其他的多个会场之间 进行通信连接, 因而在 MCU接收到多个会场间的连接请求, 并建立 了多个会场之间的连接后,对于建立了连接关系的多个会场中的任一 会场而言, MCU在接收到该会场发送的音频信号和视频信号后, 均 可以将接收到的音频信号和视频信号发送给与其建立了连接关系的 其他会场, 且在本实施例中, 对视频信号与辅助信息的叠加处理也可 以由 MCU执行。 具体地, 与上一实施例相同, 在本实施例中, MCU 接收到的第一会场发送的视频信号中包含了第一会场的多个视频对 象。
步骤 301, MCU获取用于指示视频信号中与音频信号对应的视 频对象所处的视频区域的指示信息;
MCU同样可以从接收到的音频信号和视频信号中提取出该指示 信息, 或者由第一会场提取出该指示信息, 以将该指示信息直接发送 给 MCU, 而 MCU或第一会场从音频信号或视频信号中提取出该指 示信息的实现方法具体可以参见上一实施例的描述。
MCU接收到的视频信号同样可以为一个或者多个, 当视频信号 为一个时,该指示信息仅用于指示与该视频信号中与当前音频信号对 应的视频对象所处的视频位置,即第一会场中当前正处于发言状态的 参会者在视频信号中所处的图像位置; 而当视频信号为多个时, 该指 示信息除了用于指示与音频信号对应的视频对象、即第一会场中当前 正处于发言状态的参会者在对应的视频信号中所处的视频位置之外, 还用于指示多个视频信号中当前与该音频信号所对应的视频信号,从 而基于该指示信息, MCU不仅可以得知在第一会场中, 当前处于发 言状态的参会者所对应的视频对象包含在哪个视频信号中,还能够得 知该当前处于发言状态的参会者在该对应的视频信号中所处的具体 图像位置。
需要说明的是,若在第一会场中某一时刻同时存在多名处于当前 发言状态的参会者时, 在本实施例中, 上述步骤 300中第一会场发送 的音频信号还可以为多个。 此时, 在本步骤 31中, MCU获取到的指 示信息同样也应该相应的为与音频信号数量相等的多个,且每个指示 信息分别用于指示对应的音频信号所对应的视频信号,以及对应的音 频信号所对应的视频对象各自在对应的视频信号中的视频区域。
步骤 302, MCU获取与音频信号对应的哑语手势信息和 /或与第 一会场的各参会者对应的基本身份信息;
优选地, 为了进一步方便各会场的参会者之间的沟通, 以及考虑 到对端会场中有聋哑人参会的场景时, 在 MCU对音频信号对应的文 本信息与对应的视频信号进行叠加处理之前, 该 MCU还可以获取与 当前音频信号对应的哑语手势信息,以及与第一会场的各名参会者、 即视频信号中包含的各视频对象对应的基本身份信息,该基本身份信 息具体为文本格式,而其中包括的内容具体可以为各名参会者的姓名、 职务等相关的基本信息。 而具体地, MCU获取该哑语手势信息以及 参会者的基本身份信息的方法同样可以参见上一实施例中对第二会 场执行相应步骤时的描述。
步骤 303, MCU获取与音频信号对应的文本信息;
在本步骤中, MCU在接收到音频信号后, 可以在本地对音频信 号进行语音识别处理, 以生成与音频信号对应的文本信息; 或者, 该 文本信息的转换操作还可以在第一会场中进行,第一会场可以在采集 到音频信号后, 便在本地端将音频信号进行语音识别, 以生成与音频 信号对应的文本信息, 从而在将音频信号发送给 MCU的同时, 将该 对应的文本信息一起发送给 MCU; 或者该文本信息还可以由第一会 场的会议管理员通过手工输入得到。而实际应用中, 考虑到参会者所 持的各种不同语种, 无论是该文本信息的获取是在第一会场或者 MCU进行, 在将音频信号进行语音识别转换为文本信息, 或者会议 管理员手工输入文本信息时,均可以选择将音频信号转换为对应不同 语种的多种文本信息,以在显示屏幕中显示各种不同语种的字幕信息。
步骤 304, MCU将上述文本信息、 哑语手势信息和 /或指示信息 指定的视频对象的基本身份信息与指示信息中指定的视频信号进行 叠加处理;
在获取到了与音频信号对应的文本信息、用于指示音频信号对应 的视频对象的指示信息以及哑语手势信息和 /或基本身份信息后, 在 本步骤中, MCU在指示信息的指示下, 将上述文本信息、 哑语手势 信息和 /或指示信息中指定的视频对象对应的基本身份信息与指示信 息中指定的视频信号, 在指定的视频区域进行叠加处理, 以将各种辅 助信息叠加在视频信号中与音频信号对应的视频对象的周围。
需要说明的是, 若在上述步骤 300中, MCU从第一会场获取到 的是多个音频信号, 且在上述步骤 301中, MCU从第一会场获取到 的是分别与多个音频信号对应的多个指示信息, 在本步骤 304中, MCU对视频信号进行叠加处理时, 还应当分别根据与各音频信号对 应的各指示信息,将各音频信号对应的文本信息分别与各自的指示信 息中指定的视频信号, 在指定的视频位置进行叠加处理, 优选地, 在 叠加文本信息的同时, 还可以在视频信号中叠加哑语手势信息和 /或 对应的参会者的基本身份信息。
步骤 305, MCU将除指定的视频信号之外的其他视频信号分别 与对应的基本身份信息进行叠加处理;
步骤 306, MCU将音频信号及经处理后的视频信号发送给与第 一会场连接的多个第二会场;
步骤 307, 第二会场将叠加处理后的视频信号分别在对应的显示 屏幕上进行显示;
MCU在将获取到的各辅助信息与对应的视频信号进行了叠加处 理之后,将音频信号以及经处理后的各视频信号发送给与第一会场建 立了通信连接的多个第二会场。从而对于各第二会场而言, 由于第二 会场接收到的各视频信号中已经叠加了各类辅助信息,因而第二会场 无需对接收到的视频信号进行额外的处理,而是可以直接将接收到的 多个视频信号在各自对应的显示屏幕上进行显示,而在显示的所有辅 助信息中, 各辅助信息均与对应的图像的方位保持一致。
同时需要说明的是,对于视频会议系统的 MCU点对多连接方式, 虽然本实施例的上述步骤描述的对各种辅助信息以及指示信息的获 取, 以及对视频信号的叠加处理均在 MCU中进行, 但是在实际应用 中, 这些步骤也可以在第一会场或者通过 MCU与该第一会场建立了 连接的多个第二会场中进行, 即第一会场获取指示信息, 在指示信息 的指示下对视频信号与各辅助信息进行叠加处理之后, 再通过 MCU 将叠加处理后的视频信号发送给多个第二会场;或者第二会场接收到 未经处理的音频信号和视频信号后, 获取指示信息和辅助信息, 以对 视频信号和辅助信息进行叠加处理。而无论对应于哪种实现方式, 第 二会场在将经叠加处理后的视频信号进行显示后,也能够得到本实施 例所描述的上述效果。
步骤 308, 第二会场根据与指示信息对应的音源方位信息播放音 频信号。
进一步地,为了保证第二会场播放的对端会场的发言的参会者的 声音与显示的该参会者的图像具有同方位性, 即具有一致性, 在本发 明实施例中,第二会场还将根据与指示信息对应的音源方位信息对音 频信号进行处理,以根据音源方位信息播放对端会场传送的音频信号。
本实施例的视频信号的辅助信息叠加方法,应用在多视频图像会 议系统中, 通过在 MCU将音频信号对应的文本信息与视频信号进行 叠加处理之前,获取用于指示当前音频信号在视频信号中所对应的视 频对象所处的视频区域的指示信息,并在对视频信号进行叠加处理时, 根据该指示信息将当前音频信号对应的文本信息在该音频信号对应 的视频对象所处的视频区域中与视频信号进行叠加处理,从而使得当 将经叠加处理后的视频信号被显示在对应会场终端的显示屏幕上时, 能够保证与音频信号对应的文本信息显示在相对应的视频对象的图 像周围, 保证了图像与字幕的显示方位的一致性。
进一步地, 本实施例中, 还通过在 MCU对对端会场发送的视频 信号与音频信号对应的文本信息进行叠加处理之前,对音频信号对应 的哑语手势信息和 /或对端会场的各参会者的基本身份信息进行获取, 在对文本信息与当前处于发言状态的参会者对应的视频信号进行叠 加处理的同时, 将该哑语手势信息和 /或各基本身份信息与对应的视 频对象进行叠加,从而不仅实现了在显示端会场的显示屏幕的相应位 置显示对端会场的各名参会者的基本信息,还实现了在相应位置显示 对端会场的发言者的说话内容一致的哑语手势,进一步地方便了参会 者之间的沟通。
本领域普通技术人员可以理解: 实现上述方法实施例的全部或部 分步骤可以通过程序指令相关的硬件来完成, 前述的程序可以存储于 一计算机可读取存储介质中, 该程序在执行时, 执行包括上述方法实 施例的步骤; 而前述的存储介质包括: ROM, RAM,磁碟或者光盘等 各种可以存储程序代码的介质。
图 6 为本发明实施例提供的视频信号的辅助信息叠加装置实施 例一的结构示意图。如图 6所示, 本实施例的视频信号的辅助信息叠 加装置至少包括: 信号获取模块 11、 指示信息获取模块 12和信号 叠加模块 13。
其中, 信号获取模块 11用于获取第一会场的音频信号及第一会 场的至少一个视频信号,该至少一个视频信号包含第一会场中的多个 视频对象; 指示信息获取模块 12用于获取指示信息, 该指示信息用 于指示在信号获取模块 11获取到的至少一个视频信号的多个视频对 象中、与获取到的音频信号对应的视频对象所处的视频区域; 信号叠 加模块 13则用于根据指示信息获取模块 12获取到的指示信息将与第 一会场的音频信号对应的文本信息与信号获取模块 11获取到的视频 信号进行叠加处理,以使文本信息在指示信息所指示的视频区域中显 具体地,本实施例的视频信号的辅助信息叠加装置可以设置在会 场终端或者设置在 MCU中。 若设置在会场终端, 本实施例的装置可 以设置在第一会场中,在第一会场向第二会场发送音频信号及视频信 号之前, 对视频信号进行相应的信息叠加处理, 或者本实施例的装置 还可以设置在第二会场中,在第二会场接收到第一会场发送的音频信 号及视频信号之后, 对视频信号进行相应的信息叠加处理; 而若本实 施例的设置在 MCU中, 本实施例的装置则可以在接收到任一会场发 送的音频信号及视频信号之后,对其中的视频信号进行相应的信息叠 加处理。
具体地, 本实施例中的上述所有模块所涉及的具体工作过程, 可 以参考上述视频信号的辅助信息叠加方法所涉及的相关实施例揭露 的相关内容, 在此不再赘述。
本实施例的视频信号的辅助信息叠加装置,应用在多图像的视频 会议应用场景中,通过在将音频信号对应的文本信息与视频信号进行 叠加处理之前,获取用于指示当前音频信号在视频信号中所对应的视 频对象所处的视频区域的指示信息,并在对视频信号进行叠加处理时, 根据该指示信息将当前音频信号对应的文本信息在该音频信号对应 的视频对象所处的视频区域中与视频信号进行叠加处理,从而使得当 将经叠加处理后的视频信号被显示在对应会场终端的显示屏幕上时, 能够保证与音频信号对应的文本信息显示在相对应的视频对象的图 像周围, 保证了图像与字幕的显示方位的一致性。
图 7 为本发明实施例提供的视频信号的辅助信息叠加装置实施 例二的结构示意图。 如图 7所示, 在上一实施例的基础上, 本实施例 的视频信号的辅助信息叠加装置中, 上述信号获取模块 11获取到的 视频信号可以为一个或者多个。 当信号获取模块 11获取到的视频信 号为多个时, 指示信息获取模块 12获取的指示信息所指示的视频区 域为与音频信号对应的视频对象在第一视频信号所对应的视频中所 处的视频位置, 该第一视频信号为多个视频信号中, 与音频信号所对 应的视频信号。 而若信号获取模块 11获取到的视频信号为一个时, 指示信息获取模块 12获取的指示信息所指示的视频区域则为第一会 场的视频信号中、 与所述音频信号对应的视频对象的视频位置。
上述指示信息获取模块 12至少可以包括以下任一的子模块:第 一信息获取子模块 121或者第二信息获取子模块 122。 其中,第一信息获取子模块 121用于若第一会场的音频信号为多 声道信号,确定该多声道信号中能量最大的声道信号所对应的方位为 音频信号对应的视频对象的音源方位,以生成与音频信号的音源方位 信息, 并利用音源方位与视频方位之间的对应关系, 将该音源方位信 息转换为上述用于指示音频信号的对应的视频对象所处的视频区域 的指示信息;而第二信息获取子模块 122则用于分别对第一会场的视 频信号中的参会者的唇部运动进行检测,确定唇部有开合运动的参会 者为与音频信号对应的视频对象,并确定该视频对象所处的视频区域 的指示信息。
进一步地, 在本实施例中, 视频信号的辅助信息叠加装置还可以 包括辅助信息获取模块 14。 该辅助信息获取模块 14用于在信号叠加 模块 13根据指示信息将与第一会场的音频信号对应的文本信息与视 频信号进行叠加处理之前, 获取与音频信号对应的哑语手势信息和 / 或第一会场中的各参会者的基本身份信息。相对应地, 本实施例中的 信号叠加模块 13还可以用于:将辅助信息获取模块 14获取到的哑语 手势信息和 /或与第一会场中的各参会者的基本身份信息与所述视频 信号进行叠加处理, 以使该哑语手势信息和 /或视频对象的基本身份 信息在指示信息所指示的视频区域中显示。
更进一步地, 本实施例的视频信号的辅助信息叠加装置中, 还可 以包括信号显示模块 15。 具体地, 该信号显示模块 15用于在信号叠 加模块 13根据指示信息将与第一会场的音频信号对应的文本信息与 视频信号进行叠加处理之后,将经叠加处理后的视频信号在对应的显 示屏幕上进行显示。
更进一步地, 本实施例的视频信号的辅助信息叠加装置中, 还可 以包括第一信号播放模块 161或第二信号播放模块 162中的任一模块。 其中,第一信号播放模块 161用于当指示信息是根据音频信号的声源 方位信息、利用音源方位与视频方位之间的对应关系转换得到时, 在 信号叠加模块 13根据指示信息将与第一会场的音频信号对应的文本 信息与视频信号进行叠加处理之后,根据音频信号的声源方位信息播 放音频信号;而第二信号播放模块 162则用于当指示信息是根据唇部 运动检测得到时, 在信号叠加模块 13根据指示信息将与第一会场的 音频信号对应的文本信息与视频信号进行叠加处理之后,利用音源方 位与视频方位之间的对应关系, 获取音频信号的音源方位信息, 并根 据音频信号的声源方位信息播放所述音频信号。
具体地, 本实施例中的上述所有模块所涉及的具体工作过程, 同 样可以参考上述视频信号的辅助信息叠加方法所涉及的相关实施例 揭露的相关内容, 在此不再赘述。
本实施例的视频信号的辅助信息叠加装置,应用在多图像的视频 会议应用场景中,通过在将音频信号对应的文本信息与视频信号进行 叠加处理之前,获取用于指示当前音频信号在视频信号中所对应的视 频对象所处的视频区域的指示信息,并在对视频信号进行叠加处理时, 根据该指示信息将当前音频信号对应的文本信息在该音频信号对应 的视频对象所处的视频区域中与视频信号进行叠加处理,从而使得当 将经叠加处理后的视频信号被显示在对应会场终端的显示屏幕上时, 能够保证与音频信号对应的文本信息显示在相对应的视频对象的图 像周围, 保证了图像与字幕的显示方位的一致性。
进一步地, 本实施例中, 还通过在本端会场对对端会场发送的多 个视频信号进行显示之前,对音频信号对应的哑语手势信息以及与各 视频信号对应的对端会场的各参会者的基本身份信息进行获取,在对 文本信息与当前处于发言状态的参会者对应的视频信号进行叠加处 理的同时,将该哑语手势信息以及各基本身份信息叠加对应的视频信 号中,从而不仅实现了在本端会场的显示屏幕的相应位置显示对端会 场的各名参会者的基本信息,还实现了在相应位置显示对端会场的发 言者的说话内容的哑语手势,进一步地方便了参会者之间的顺利沟通。
最后应说明的是: 以上实施例仅用以说明本发明的技术方案, 而 非对其限制; 尽管参照前述实施例对本发明进行了详细的说明, 本领 域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技 术方案进行修改, 或者对其中部分技术特征进行等同替换; 而这些修 改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方 案的精神和范围。

Claims

权 利 要 求 书
1、 一种视频信号的辅助信息叠加方法, 其特征在于, 包括: 获取第一会场的音频信号及第一会场的至少一个视频信号,所述 至少一个视频信号包含所述第一会场中的多个视频对象;
获取指示信息,所述指示信息用于指示在所述至少一个视频信号 的多个视频对象中、与所述音频信号对应的视频对象所处的视频区域; 根据所述指示信息将与所述第一会场的音频信号对应的文本信 息与所述视频信号进行叠加处理,以使所述文本信息在所述指示信息 所指示的视频区域中显示。
2、 根据权利要求 1所述的方法, 其特征在于:
当所述视频信号为多个视频信号时,所述指示信息所指示的视频 区域为与所述音频信号对应的视频对象在第一视频信号所对应的视 频中所处的视频位置, 所述第一视频信号为所述多个视频信号中、与 所述音频信号对应的视频信号; 或者,
当所述视频信号为一个时,所述指示信息所指示的视频区域为所 述第一会场的视频信号中、与所述音频信号对应的视频对象的视频位 置。
3、 根据权利要求 1或 2所述的方法, 其特征在于, 所述指示信 息通过如下方式获得:
若所述第一会场的音频信号为多声道信号,确定所述多声道信号 中能量最大的声道信号所对应的方位为所述音频信号对应的视频对 象的音源方位, 以生成所述音频信号的音源方位信息, 并利用音源方 位与视频方位之间的对应关系,将所述音源方位信息转换为用于指示 所述音频信号的对应的视频对象所处的视频区域的指示信息;
或者,分别对所述第一会场的视频信号中的参会者的唇部运动进 行检测, 确定唇部有开合运动的参会者为所述视频对象, 并确定所述 音频信号对应的视频对象所处的视频区域的指示信息。
4、 根据权利要求 1或 2所述的方法, 其特征在于:
所述根据所述指示信息将与所述第一会场的音频信号对应的文 本信息与所述视频信号进行叠加处理之前, 所述方法还包括: 获取与 所述音频信号对应的哑语手势信息和 /或所述第一会场中的各所述视 频对象的基本身份信息;
所述根据所述指示信息将与所述第一会场的音频信号对应的文 本信息与所述视频信号进行叠加处理还包括:将所述哑语手势信息和 /或所述指示信息所指示的视频对象的基本身份信息与所述视频信号 进行叠加处理, 以使所述哑语手势信息和 /或所述视频对象的基本身 份信息在所述指示信息所指示的视频区域中显示。
5、 根据权利要求 1或 2所述的方法, 其特征在于, 所述根据所 述指示信息将与所述第一会场的音频信号对应的文本信息与所述视 频信号进行叠加处理之后, 所述方法还包括:
将所述经叠加处理后的视频信号在对应的显示屏幕上进行显示。
6、 根据权利要求 3所述的方法, 其特征在于, 所述根据所述指 示信息将与所述第一会场的音频信号对应的文本信息与所述视频信 号进行叠加处理之后, 所述方法还包括: 当所述指示信息是根据所述音频信号的声源方位信息、利用音源 方位与视频方位之间的对应关系转换得到时,根据所述音频信号的声 源方位信息播放所述音频信号;
当所述指示信息是根据所述唇部运动检测得到时,利用所述音源 方位与视频方位之间的对应关系,获取所述音频信号的音源方位信息 并根据所述音频信号的声源方位信息播放所述音频信号。
7、 一种视频信号的辅助信息叠加装置, 其特征在于, 包括: 信号获取模块,用于获取第一会场的音频信号及第一会场的至少 一个视频信号,所述至少一个视频信号包含所述第一会场中的多个视 频对象;
指示信息获取模块, 用于获取指示信息, 所述指示信息用于指示 在所述至少一个视频信号的多个视频对象中、与所述音频信号对应的 视频对象所处的视频区域;
信号叠加模块,用于根据所述指示信息将与所述第一会场的音频 信号对应的文本信息与所述视频信号进行叠加处理,以使所述文本信 息在所述指示信息所指示的视频区域中显示。
8、 根据权利要求 7所述的装置, 其特征在于:
当所述视频信号为多个时,所述指示信息获取模块获取的指示信 息所指示的视频区域为所述音频信号对应的视频对象在第一视频信 号所对应的视频中所处的视频位置,所述第一视频信号为所述多个视 频信号中、 与所述音频信号所对应的视频信号; 或者,
当所述视频信号为一个时,所述指示信息获取模块获取的指示信 息所指示的视频区域为所述第一会场的视频信号中、与所述音频信号 对应的视频对象的视频位置。
9、 根据权利要求 7或 8所述的装置, 其特征在于, 所述指示信 息获取模块包括:
第一信息获取子模块,用于若所述第一会场的音频信号为多声道 信号,确定所述多声道信号中能量最大的声道信号所对应的方位为所 述音频信号对应的视频对象的音源方位,以生成所述音频信号的音源 方位信息, 并利用音源方位与视频方位之间的对应关系, 将所述音源 方位信息转换为用于指示所述音频信号的对应的视频对象所处的视 频区域的指示信息;
第二信息获取子模块,用于分别对所述第一会场的视频信号中的 参会者的唇部运动进行检测,确定唇部有开合运动的参会者为所述音 频信号对应的视频对象,并确定所述视频对象所处的视频区域的指示 信息。
10、 根据权利要求 7或 8所述的装置, 其特征在于:
辅助信息获取模块,用于在所述信号叠加模块根据所述指示信息 将与所述第一会场的音频信号对应的文本信息与所述视频信号进行 叠加处理之前, 获取与所述音频信号对应的哑语手势信息和 /或所述 第一会场中的各参会者的基本身份信息;
所述信号叠加模块还用于: 将所述哑语手势信息和 /或与所述第 一会场中的各参会者的基本身份信息与所述视频信号进行叠加处理, 以使所述哑语手势信息和 /或所述视频对象的基本身份信息在所述指 示信息所指示的视频区域中显示。
11、 根据权利要求 7或 8所述的装置, 其特征在于, 所述装置还 包括:
信号显示模块,用于在所述信号叠加模块根据所述指示信息将与 所述第一会场的音频信号对应的文本信息与所述视频信号进行叠加 处理之后,将所述经叠加处理后的视频信号在对应的显示屏幕上进行 显不。
12、 根据权利要求 7或 8所述的装置, 其特征在于, 所述装置还 包括:
第一信号播放模块,用于当所述指示信息是根据所述音频信号的 声源方位信息、利用音源方位与视频方位之间的对应关系转换得到时, 在所述信号叠加模块根据所述指示信息将与所述第一会场的音频信 号对应的文本信息与所述视频信号进行叠加处理之后,根据所述音频 信号的声源方位信息播放所述音频信号;
第二信号播放模块,用于当所述指示信息是根据所述唇部运动检 测得到时,在所述信号叠加模块根据所述指示信息将与所述第一会场 的音频信号对应的文本信息与所述视频信号进行叠加处理之后,利用 所述音源方位与视频方位之间的对应关系,获取所述音频信号的音源 方位信息, 并根据所述音频信号的声源方位信息播放所述音频信号。
PCT/CN2011/083005 2010-11-30 2011-11-26 视频信号的辅助信息叠加方法及装置 WO2012072008A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010574447.1 2010-11-30
CN 201010574447 CN102006453B (zh) 2010-11-30 2010-11-30 视频信号的辅助信息叠加方法及装置

Publications (1)

Publication Number Publication Date
WO2012072008A1 true WO2012072008A1 (zh) 2012-06-07

Family

ID=43813473

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/083005 WO2012072008A1 (zh) 2010-11-30 2011-11-26 视频信号的辅助信息叠加方法及装置

Country Status (2)

Country Link
CN (1) CN102006453B (zh)
WO (1) WO2012072008A1 (zh)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102006453B (zh) * 2010-11-30 2013-08-07 华为终端有限公司 视频信号的辅助信息叠加方法及装置
CN102497555B (zh) * 2011-12-19 2014-04-16 青岛海信网络科技股份有限公司 一种高清编码器的中文支持方法及系统
KR101984823B1 (ko) 2012-04-26 2019-05-31 삼성전자주식회사 웹 페이지에 주석을 부가하는 방법 및 그 디바이스
CN103795938A (zh) * 2012-10-30 2014-05-14 中兴通讯股份有限公司 远程呈现系统中多显示器滚动显示方法及装置、处理终端
CN104104703B (zh) * 2013-04-09 2018-02-13 广州华多网络科技有限公司 多人音视频互动方法、客户端、服务器及系统
CN104301659A (zh) * 2014-10-24 2015-01-21 四川省科本哈根能源科技有限公司 一种多点视频汇聚识别系统
CN105635635A (zh) * 2014-11-19 2016-06-01 杜比实验室特许公司 调节视频会议系统中的空间一致性
CN105677287B (zh) * 2015-12-30 2019-04-26 苏州佳世达电通有限公司 显示装置的控制方法以及主控电子装置
CN109299680A (zh) * 2016-01-20 2019-02-01 杭州虹晟信息科技有限公司 视频网络会议的人物识别方法
CN107124647A (zh) * 2017-05-27 2017-09-01 深圳市酷开网络科技有限公司 一种全景视频录制时自动生成字幕文件的方法及装置
CN108574688B (zh) * 2017-09-18 2021-01-01 视联动力信息技术股份有限公司 一种参会方信息的显示方法和装置
CN108965783B (zh) * 2017-12-27 2020-05-26 视联动力信息技术股份有限公司 一种视频数据处理方法及视联网录播终端
CN108259801A (zh) * 2018-01-19 2018-07-06 广州视源电子科技股份有限公司 音视频数据显示方法、装置、设备及存储介质
CN108366216A (zh) * 2018-02-28 2018-08-03 深圳市爱影互联文化传播有限公司 会议视频录制、记录及传播方法、装置及服务器
CN110324723B (zh) * 2018-03-29 2022-03-08 华为技术有限公司 字幕生成方法及终端
CN109302576B (zh) * 2018-09-05 2020-08-25 视联动力信息技术股份有限公司 会议处理方法和装置
CN109873973B (zh) 2019-04-02 2021-08-27 京东方科技集团股份有限公司 会议终端和会议系统
CN110290341A (zh) * 2019-07-24 2019-09-27 长沙世邦通信技术有限公司 跟随人脸同步显示字幕的可视对讲方法、系统及存储介质
CN111818294A (zh) * 2020-08-03 2020-10-23 上海依图信息技术有限公司 结合音视频的多人会议实时展示的方法、介质和电子设备
WO2022104800A1 (zh) 2020-11-23 2022-05-27 京东方科技集团股份有限公司 一种虚拟名片的发送方法、装置、系统及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009105303A1 (en) * 2008-02-20 2009-08-27 Microsoft Corporation Techniques to automatically identify participants for a multimedia conference event
CN101710961A (zh) * 2009-12-09 2010-05-19 中兴通讯股份有限公司 电视会议中生成字幕的控制方法及装置
US20100250252A1 (en) * 2009-03-27 2010-09-30 Brother Kogyo Kabushiki Kaisha Conference support device, conference support method, and computer-readable medium storing conference support program
CN102006453A (zh) * 2010-11-30 2011-04-06 华为终端有限公司 视频信号的辅助信息叠加方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009105303A1 (en) * 2008-02-20 2009-08-27 Microsoft Corporation Techniques to automatically identify participants for a multimedia conference event
US20100250252A1 (en) * 2009-03-27 2010-09-30 Brother Kogyo Kabushiki Kaisha Conference support device, conference support method, and computer-readable medium storing conference support program
CN101710961A (zh) * 2009-12-09 2010-05-19 中兴通讯股份有限公司 电视会议中生成字幕的控制方法及装置
CN102006453A (zh) * 2010-11-30 2011-04-06 华为终端有限公司 视频信号的辅助信息叠加方法及装置

Also Published As

Publication number Publication date
CN102006453B (zh) 2013-08-07
CN102006453A (zh) 2011-04-06

Similar Documents

Publication Publication Date Title
WO2012072008A1 (zh) 视频信号的辅助信息叠加方法及装置
US11418758B2 (en) Multiple simultaneous framing alternatives using speaker tracking
US9641585B2 (en) Automated video editing based on activity in video conference
US9860486B2 (en) Communication apparatus, communication method, and communication system
US10771694B1 (en) Conference terminal and conference system
WO2018209879A1 (zh) 自动选择摄像头画面的方法、装置及音视频系统
US11076127B1 (en) System and method for automatically framing conversations in a meeting or a video conference
EP2154885A1 (en) A caption display method and a video communication system, apparatus
EP2352290B1 (en) Method and apparatus for matching audio and video signals during a videoconference
WO2015070558A1 (zh) 一种控制视频拍摄的方法和装置
CN104639777A (zh) 一种会议控制方法、装置及会议系统
CN102368816A (zh) 一种视频会议智能前端系统
WO2014094461A1 (zh) 视频会议中的视音频信息的处理方法、装置及系统
CN114827517A (zh) 一种投影视频会议系统及视频投影方法
CN105959614A (zh) 一种视频会议的处理方法及系统
US20230283888A1 (en) Processing method and electronic device
EP4106326A1 (en) Multi-camera automatic framing
US10979666B2 (en) Asymmetric video conferencing system and method
CN113676693B (zh) 画面呈现方法、视频会议系统及可读存储介质
CN111107301A (zh) 一种视频会议平台及基于视频会议平台的通讯方法
WO2016206471A1 (zh) 多媒体业务处理方法、系统及装置
CN115412702A (zh) 一种会议终端与电视墙一体化设备及系统
JP2017103641A (ja) 情報処理装置、会議システム、情報処理方法およびプログラム
CN211830976U (zh) 一种视频会议平台
WO2013060295A1 (zh) 一种视频处理方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11844683

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11844683

Country of ref document: EP

Kind code of ref document: A1