Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a first embodiment of a method for superimposing auxiliary information on a video signal according to an embodiment of the present invention, as shown in fig. 1, the method of this embodiment includes the following steps:
step 100, acquiring an audio signal of a first session and at least one video signal containing a plurality of video objects in the first session;
the method for superimposing auxiliary information on a video signal according to the embodiment of the present invention may be applied to a remote presentation conference system or other types of video communication systems in which a plurality of display screens corresponding to a plurality of participants at an opposite terminal conference site are respectively set in the conference site, and each step in the embodiment of the present invention may be executed by an information superimposing apparatus set in the conference site or a server. In these video conference systems, each conference terminal is provided with a plurality of display screens for respectively displaying a plurality of participants at the opposite terminal conference, so that for each conference terminal, after establishing a normal communication connection with the opposite terminal conference, the conference terminal transmits to the opposite terminal conference, at least one video signal including a plurality of participants at the local terminal conference, that is, a plurality of video objects at the local terminal conference, and a current audio signal of the local terminal conference, where the audio signal is generated by a participant currently speaking among the plurality of participants at the local terminal conference, and specifically, the audio signal corresponds to one of the plurality of video objects included in the video signal.
As for the remote presentation conference system, there are two connection methods, one is a point-to-point connection method performed between two conference sites through a network, and the other is a point-to-Multipoint connection method performed between a plurality of conference sites through a Multipoint Control Unit (MCU) disposed between the plurality of conference sites. In the point-to-point connection mode, the local terminal meeting place and the opposite terminal meeting place directly transmit signal data, and in the point-to-multipoint connection mode, the signal transmission between the meeting places is transmitted through the MCU, so that the information superposition device can be arranged at the meeting place terminal or in the MCU. In the embodiment of the present invention, the current audio signal and the current video signal are collected, and the local conference hall which sends the collected signals to the opposite conference hall is referred to as a first conference hall. In this step, for the information superimposing apparatus for performing the auxiliary information superimposing process on the video signal, the information superimposing apparatus will receive the audio signal of the first meeting place and at least one video signal containing a plurality of video objects, regardless of whether the apparatus is provided in the first meeting place or an opposite meeting place opposite to the first meeting place, or in the MCU.
Step 101, acquiring indication information, wherein the indication information is used for indicating a video area in which a video object corresponding to the acquired audio signal is located in a plurality of video objects of the at least one video signal;
after receiving the video signal and the audio signal of the first meeting place, in order to accurately superimpose the text information corresponding to the audio signal in the form of subtitles on the corresponding video areas of the corresponding video signal and the video signal, so that when the video signal after the superimposition processing is displayed on the display screen of the opposite meeting place, the subtitle information can be displayed below the image of the speaking participant to facilitate the communication between the participants of the meeting place.
And 102, overlapping text information corresponding to the acquired audio signal with the video signal according to the indication information so that the text information is displayed in a video area indicated by the indication information.
After the indication information for indicating the video area in which the video object corresponding to the current audio signal is located in the video signal is acquired, according to the indication information, the information superposition device can superpose the video area specified in the indication information by the text information corresponding to the audio signal and the video signal, and superpose the audio signal in the video area corresponding to the video signal in a text information manner, namely, with the corresponding video object.
The auxiliary information superposition method of the video signal of the embodiment is applied to the video conference application scene of multiple images, by acquiring indication information indicating a video area in which a video object corresponding to a current audio signal in a video signal is located before superimposing text information corresponding to the audio signal on the video signal, and when superimposing the video signal, according to the indication information, the text information corresponding to the current audio signal is superposed with the video signal in the video area where the video object corresponding to the audio signal is located, so that when the video signal subjected to the superimposition processing is displayed on the display screen of the corresponding venue terminal, the text information corresponding to the audio signal can be ensured to be displayed around the image of the corresponding video object, and the consistency of the display directions of the image and the caption is ensured.
Fig. 2 is a flowchart of a second method for superimposing auxiliary information on a video signal according to an embodiment of the present invention. Fig. 3 is a schematic diagram of a conference room to which an auxiliary information superimposing method for video signals according to an embodiment of the present invention is applied. Specifically, in the embodiment of the method, a point-to-point connection manner is taken as an example between the conference places, for example, a first conference place and a second conference place are taken as examples to establish a communication connection, the first conference place is a signal acquisition sending end, and the second conference place is a signal receiving and displaying end, and a specific flow of how an information superimposing apparatus arranged in the second conference place acquires an audio signal, a video signal and indication information, how to superimpose auxiliary information on the video signal, and how to display the superimposed video signal is described. As shown in fig. 2, the method of the present embodiment includes the following steps:
200, a second conference place receives an audio signal sent by a first conference place and at least one video signal containing a plurality of video objects;
in this step, after the first conference room locally acquires the current audio signal and at least one video signal corresponding to the plurality of video objects, the audio signal and the video signal are directly sent to the second conference room through the network. In practical applications, the video signal containing the plurality of video objects in the first session may be a single video signal or a plurality of video signals, that is, the correspondence between the video signal and the video object may be a one-to-one or a one-to-many relationship, and the specific correspondence is different according to different settings between the camera and the participant in the first session. Such as: three cameras are adopted in the first meeting place, and three video signals can be obtained through the three cameras; it is also possible to use a wide-angle camera/panoramic camera, with which one camera or panoramic camera can capture 1 video signal of a complete image of the first session.
Step 201, a second meeting place acquires indication information for indicating a video area where a video object corresponding to an audio signal is located in a video signal;
after receiving the audio signal and the video signal sent by the first conference site, the second conference site further obtains indication information for indicating a video area where the video object corresponding to the audio signal is located in the video signal, where the indication information is specifically image position information for indicating a position where a participant currently in a speaking state is located in the first conference site, so that text information corresponding to the audio signal is superimposed with a corresponding video object in the corresponding video signal.
Specifically, since in the above step 200, the video signals received by the second conference room may be one or more, when there is one video signal, the indication information is only used to indicate the video position where the video object corresponding to the current audio signal in the video signal is located, that is, the image position where the participant currently in the speaking state in the first conference room is located in the video signal; when there are a plurality of video signals, the indication information is used to indicate a video object corresponding to the audio signal, that is, a video position of a participant currently in the speaking state in the first conference room in the corresponding video signal, and is also used to indicate a video signal currently corresponding to the audio signal, that is, the first video signal, in the plurality of video signals, so that based on the indication information, the second conference room can know not only in which video signal the video object corresponding to the participant currently in the speaking state is included in the first conference room, but also a specific image position of the participant currently in the speaking state in the corresponding video signal.
In this embodiment, the step of acquiring the indication information by the second meeting place may be specifically implemented by the following several ways:
the second venue may extract the indication from the received audio and video signals. Specifically, the extraction of the indication information by the second meeting place can be realized in two ways:
in a first manner, the second conference room may extract, from the audio signal, sound source direction information indicating a direction in which the current audio signal is located in the first conference room based on the received audio signal, so as to convert the extracted sound source direction information into image position information indicating a video area in which a video object corresponding to the audio signal is located according to a corresponding relationship between the sound source direction and the video direction, where the image position information is specific indication information. Specifically, in practical applications, the audio signal transmitted between the conference rooms is usually a multi-channel signal, and when the audio signal is a multi-channel signal, the second conference room may compare the energy of each channel signal in the multi-channel signal, so that according to the comparison result, the second conference room may determine the channel signal with the largest energy, and the sound source direction corresponding to the channel signal with the largest energy is the sound source direction corresponding to the current audio signal. Therefore, the second meeting place can determine the direction corresponding to the sound channel signal with the maximum energy as the sound source direction of the current audio signal according to the corresponding relation between each sound channel signal and the horizontal direction stored by the second meeting place, and accordingly the sound source direction information is extracted.
In a second mode, the second conference hall can also directly extract image position information, that is, the indication information, from the plurality of video signals based on one or more video signals transmitted by the first conference hall. Specifically, after receiving the video signal of the first venue, the second venue may capture and detect the motion state of lips of each video object, that is, each participant, included in the image of the video signal, that is, detect whether there is open and close motion of lips of each participant in the image corresponding to the video signal, so as to determine the video signal corresponding to the participant currently in the speaking state in the first venue and the image position of the participant in the corresponding video signal. If there is an opening and closing movement of the lips of a participant in the corresponding image corresponding to a certain video signal, it can be determined that the participant contained in the video signal is the participant currently in the speaking state, so that the second conference room can determine the image position information of the participant currently in the speaking state in the first conference room according to the correspondence between the video signals and the image positions stored by the second conference room, and the indication information can be obtained as well.
Furthermore, in the embodiment of the present invention, the second meeting place may also directly receive the indication information sent by the first meeting place, that is, the step of extracting the indication information from the audio signal and the video signal is performed in the first meeting place, and after the indication information is extracted from the collected audio signal or video signal by the first meeting place according to the method described above, the indication information is directly sent to the second meeting place. Specifically, after the first conference hall extracts the indication information, the indication information can be sent to the second conference hall along with the collected audio signal and video signal, so that the second conference hall can directly know the video signal corresponding to the currently received audio signal and the video position of the video object corresponding to the audio signal in the video signal according to the indication information.
It should be noted that, in addition to the above-described extraction from the audio signal or the video signal, the manner of locally determining the indication information at the first venue may be implemented by other manners: for example, if a corresponding microphone is provided for each participant in the first conference room, when any participant in the first conference room speaks, the microphone device corresponding to the participant in the speaking state can record the sound source direction information corresponding to the current audio signal, so that the sound source direction information can be converted into corresponding image position information by the first conference room, and the image position information is sent together when the audio signal and the video signal are sent to the second conference room; secondly, if a microphone array is arranged for each participant in the first meeting place, based on the functions of the microphone array, the microphone array can acquire audio signals and audio direction information corresponding to the audio signals at the same time; and thirdly, even if the opposite-end meeting place is not provided with the microphone array and the corresponding microphone is not arranged for each participant, the first meeting place can be manually input with the indication information by the field management personnel in a manual input mode, so that the first meeting place can also send the indication information to the second meeting place. In the embodiment of the present invention, there may be a plurality of methods for determining the video signal corresponding to the local audio signal in the first session, and the embodiment of the present invention does not limit this method.
It should be noted that, if there are multiple participants in the current speaking state at a certain time in the first conference room, in this embodiment, the number of the audio signals received by the second conference room in the step 200 may also be multiple. In this case, in step 201, the indication information acquired by the second conference room should be a plurality of indication information equal to the number of audio signals, and each indication information is used for indicating the video signal corresponding to the corresponding audio signal and the video area of the video object corresponding to the corresponding audio signal in the corresponding video signal.
Step 202, the second meeting place obtains the sign information of the dumb language corresponding to the audio signal and/or the basic identity information corresponding to each participant of the first meeting place;
preferably, in this embodiment, in order to further facilitate communication between participants, when a scene in which a deaf-mute participates is considered, before the second conference room performs the superimposition processing on the text information corresponding to the audio signal and the corresponding video signal, the second conference room may further obtain the mute gesture information corresponding to the current audio signal; and in order to facilitate communication among the participants, the second meeting place can also acquire basic identity information corresponding to each participant, namely each video object, in the first meeting place, so that the mute gesture information, the basic identity information and the text information are superposed in the corresponding video signals in the video signal superposition processing process. The basic identity information is specifically in a text format, and the content included in the basic identity information can be related basic information such as names, titles and the like of all the participants.
Specifically, the mute gesture information may be obtained by converting the audio signal at the local end after the audio signal is received by the second conference room, or the mute gesture information may be obtained by converting the audio signal into the mute gesture information after the audio signal is collected by the first conference room, so as to carry the mute gesture information in the audio signal and send the mute gesture information to the second conference room. In any case, if the sign information of the dummy language is obtained by directly converting the audio signal, when the sign information of the dummy language is superposed in the corresponding video signal and displayed on the corresponding display screen, the virtual human with the sign of the dummy language is displayed. In practical application, if a scheme of directly converting the audio signal into the mute gesture information is not adopted, the mute gesture information can be obtained by additionally arranging an interpreter responsible for translating the speaking content of the participant into the mute gesture in the first meeting place at the opposite end and arranging a corresponding camera for the interpreter so as to transmit the shot video signal of the interpreter to the second meeting place through the network. In response to this situation, the video signal corresponding to the interpreter is the dummy gesture information corresponding to the current audio signal, and when the dummy gesture information is superimposed on the corresponding video signal and displayed on the corresponding display screen, a real person with the dummy gesture is displayed.
For the basic identity information of each participant in the first conference room acquired by the second conference room, if the second conference room acquires the basic identity information before the video signal is subjected to the superposition processing, in the process of carrying out the superposition processing on the video signal, the second conference room not only superposes the text information and the mute gesture information corresponding to the audio signal and the designated video signal in the designated video area, but also superposes the basic identity information of the video object designated in the indication information and the designated video signal.
Step 203, the second meeting place acquires text information corresponding to the audio signal;
for the audio signal, the second conference room needs to convert the audio signal into corresponding text information in order to superimpose the text information corresponding to the audio signal on the corresponding video signal. Specifically, in this step, after receiving the audio signal, the second venue may perform a speech recognition process on the audio signal locally to generate text information corresponding to the audio signal. It should be noted that the text information conversion operation may also be performed in the first meeting place, and after the audio signal is collected, the first meeting place may perform voice recognition on the audio signal at the local end to generate text information corresponding to the audio signal, so that the corresponding text information is sent to the second meeting place together while the audio signal is sent to the second meeting place; or the text information may also be manually entered by the meeting administrator at the first meeting place. In practical applications, in consideration of different languages held by participants, no matter whether the text information is obtained in the first meeting place or the second meeting place, when the audio signal is converted into the text information through voice recognition, or when a conference manager manually inputs the text information, the audio signal can be selected to be converted into a plurality of text information corresponding to the different languages, so that subtitle information of the different languages is displayed on a display screen.
Step 204, the second meeting place carries out superposition processing on the basic identity information of the video object specified by the text information, the sign information of the mute language and/or the indication information and the video signal specified in the indication information;
after the text information corresponding to the audio signal, the indication information for indicating the video object corresponding to the audio signal, and the mute gesture information and/or the basic identity information are acquired, in this step, under the indication of the indication information, the second conference room may perform a superimposition process on the basic identity information corresponding to the video object specified in the text information, the mute gesture information, and/or the indication information and the video signal specified in the indication information in a specified video area, so as to superimpose various auxiliary information around the video object corresponding to the audio signal in the video signal.
Note that, if the second conference room acquires a plurality of audio signals from the first conference room in step 200 and acquires a plurality of pieces of instruction information corresponding to the plurality of audio signals from the first conference room in step 201, in this step 204, when the second conference room superimposes the video signals, the text information corresponding to each audio signal should be superimposed on the video signals specified in the instruction information in accordance with each piece of instruction information corresponding to each audio signal, and the video signals should be superimposed at the specified video position.
Step 205, the second conference room respectively superimposes the other video signals except the designated video signal with the corresponding basic identity information;
further preferably, in the embodiment of the present invention, in addition to the video objects other than the video object specified in the indication information, the second conference room superimposes auxiliary information such as text information and sign information of a mute language corresponding to the audio signal on the corresponding video signal in the specified video area, and in step 202, the second conference room acquires the basic identity information of each participant in the first conference room corresponding to each video signal from the first conference room, and in this step, the second conference room may further superimpose the basic identity information corresponding to each of the other video objects other than the video object specified by the indication information on the video signal in which the video object is located in the corresponding video area, so that when the superimposed video signals are displayed on the display screen corresponding to the second conference room, the participants in the second conference room can also display images of all participants in the first conference room on the display screen The respective essential information of these participants is seen nearby.
Step 206, the second conference room displays the video signals after the superposition processing on corresponding display screens respectively;
after the auxiliary information and the corresponding video signals are superposed, the second conference place respectively displays the processed video signals on the corresponding display screens, and because the superposition operation of the video signals by the second conference place is performed under the indication of the indication information, the subtitle text information, the mute gesture information and the basic identity information of each participant corresponding to the audio signals can be accurately superposed in the video position corresponding to the participant currently in the speaking state in the first conference place, so that the display position of the auxiliary information is completely consistent with the image corresponding to the participant currently in the speaking state when the superposed video signals are displayed on the display screen of the second conference place.
Therefore, for each participant in the second meeting place, each participant in the second meeting place can see the image of the participant speaking in the opposite meeting place on the display screen, and can also see the caption information corresponding to the speaking content of the participant and the basic identity information of the participant around the image.
Meanwhile, it should be noted that, for the point-to-point connection mode between the conference sites of the video conference system, although the acquisition of various auxiliary information and indication information and the superimposition processing of the video signal described in the above steps of this embodiment are all performed in the second conference site, in practical applications, these steps may also be performed in the first conference site, that is, the first conference site acquires the indication information, and after the video signal and each auxiliary information are superimposed under the indication of the indication information, the video signal after the superimposition processing is directly sent to the second conference site.
Fig. 4 is a schematic diagram illustrating a display effect of the superimposed video signals on multiple screens according to an embodiment of the present invention. Taking 4 participants in the first meeting place as an example, when the video signal corresponding to the audio signal is the video signal with the serial number 2 indicated in the indication information, in the above step 204 of this embodiment, the second meeting place superimposes the text information generated by the audio signal, the sign information of the mute language converted from the text information, the basic identity information of the participant corresponding to the video signal with the serial number 2 and the video signal with the serial number 2, and finally, the processed video signals are respectively displayed on a plurality of screens of the second meeting place. As shown in fig. 4, on the plurality of display screens of the second venue, preferably, text information corresponding to the audio signal may be displayed below the corresponding image, basic identity information may be displayed above the corresponding image, and the sign information of the dumb language may be displayed on any side of the corresponding image, so that the consistency of the displayed auxiliary information and the image is ensured.
Note that, in step 205, if the second venue further superimposes the video signals other than the designated video signal with the corresponding basic identity information, the basic identity information of each of the 3 participants is displayed near the display images of the other 3 participants in the effect diagram shown in fig. 4. The specific display position of the various auxiliary information on the display screen may be determined according to specific requirements, and the embodiment of the present invention does not limit this.
And step 207, the second meeting place plays the audio signal according to the sound source direction information corresponding to the indication information.
Meanwhile, in order to further ensure that the sound of the speaking participant at the opposite terminal conference room played by the second conference room has the same directionality, i.e. consistency, with the displayed image of the participant, in the embodiment of the present invention, the second conference room further processes the audio signal according to the sound source direction information corresponding to the indication information, so as to play the audio signal transmitted by the opposite terminal conference room according to the sound source direction information.
Specifically, in step 201, if the indication information obtained by the second conference room is extracted from the audio signal, that is, the sound source direction information corresponding to the indication information is extracted according to the audio signal, and then the indication information is obtained by converting the correspondence between the sound source direction and the video direction. If the second conference room obtains the indication information according to the lip movement detection of the video object in the video signal in the process of obtaining the indication information in step 201, in this step, the second conference room further converts the obtained indication information into corresponding sound source direction information by using the corresponding relationship between the sound source direction and the video direction, and then plays the received audio signal according to the sound source direction information, so as to ensure the consistency of the sound and the image at the second conference room end.
It should be noted that, if the audio signal sent by the first conference room is a multi-channel signal, since the multi-channel signal itself contains the sound source direction information, when the multi-channel signal is played, the second conference room directly uses a plurality of loudspeakers with corresponding numbers at the conference room end to play the multi-channel signal, so that the played sound can have a direction sense.
The method for superimposing auxiliary information on video signals of the embodiment is applied to a multi-video image conference system, before the text information corresponding to the audio signal is superposed with the video signal in the second meeting place, the indication information for indicating the video area where the video object corresponding to the current audio signal in the video signal is located is acquired, and when the video signal is superposed, according to the indication information, the text information corresponding to the current audio signal is superposed with the video signal in the video area where the video object corresponding to the audio signal is located, so that when the video signal subjected to the superimposition processing is displayed on the display screen of the corresponding venue terminal, the text information corresponding to the audio signal can be ensured to be displayed around the image of the corresponding video object, and the consistency of the display directions of the image and the caption is ensured.
Further, in this embodiment, before the second conference room superimposes the video signal sent by the opposite terminal conference room and the text information corresponding to the audio signal, the mute gesture information corresponding to the audio signal and/or the basic identity information of each participant in the opposite terminal conference room are obtained, and while the text information and the video signal corresponding to the participant currently in the speaking state are superimposed, the mute gesture information and/or the basic identity information are superimposed on the corresponding video object, so that not only is the basic information of each participant in the opposite terminal conference room displayed at the corresponding position of the display screen of the display terminal conference room, but also the mute gesture with consistent speaking content of the speaker in the opposite terminal conference room displayed at the corresponding position is realized, and further, the communication between the participants is facilitated.
Fig. 5 is a flowchart of a third method for superimposing auxiliary information on a video signal according to an embodiment of the present invention. The method of this embodiment takes a point-to-multipoint connection manner of the MCU as an example, and describes a specific flow of how an information superimposing apparatus arranged in the MCU acquires an audio signal, a video signal, and indication information, how to perform superimposing processing of auxiliary information on the video signal, and send the video signal after the superimposing processing to a meeting place to be displayed. Taking the schematic view of the conference hall shown in fig. 3 as an example, the three parts of the display screen shown in fig. 3 can respectively display the image information of the participants from 3 different conference halls, i.e. in the video conference, 4 conference halls are participating in the conference at the same time.
As shown in fig. 5, the method of this embodiment mainly includes the following steps:
step 300, a first meeting place sends the collected audio signals and at least one video signal containing a plurality of video objects to an MCU;
in this embodiment, the first meeting place may be in communication connection with the other meeting places through the MCU, so that after the MCU receives the connection request between the meeting places and establishes the connection between the meeting places, for any one of the meeting places with established connection relationship, after receiving the audio signal and the video signal sent by the meeting place, the MCU may send the received audio signal and the received video signal to the other meeting places with established connection relationship, and in this embodiment, the process of superimposing the video signal and the auxiliary information may also be performed by the MCU. Specifically, as in the previous embodiment, in this embodiment, the video signal received by the MCU and transmitted by the first conference room includes a plurality of video objects of the first conference room.
Step 301, the MCU acquires indication information for indicating a video area where a video object corresponding to the audio signal is located in the video signal;
the MCU may also extract the indication information from the received audio signal and video signal, or extract the indication information from the first meeting place, so as to directly send the indication information to the MCU, and the implementation method of the MCU or the first meeting place extracting the indication information from the audio signal or the video signal may specifically refer to the description of the above embodiment.
The number of the video signals received by the MCU may also be one or more, and when there is one video signal, the indication information is only used to indicate the video position of the video object corresponding to the current audio signal in the video signal, that is, the image position of the participant currently in the speaking state in the first conference hall in the video signal; when there are a plurality of video signals, the indication information is used to indicate the video object corresponding to the audio signal, that is, the video position of the participant currently speaking in the speaking state in the first conference room in the corresponding video signal, and is also used to indicate the video signal currently corresponding to the audio signal in the plurality of video signals, so that based on the indication information, the MCU can not only know in which video signal the video object corresponding to the participant currently speaking in the speaking state is included in the first conference room, but also know the specific image position of the participant currently speaking in the corresponding video signal.
It should be noted that, if there are multiple participants in the current speaking state at a certain time in the first meeting place, in this embodiment, the number of the audio signals sent by the first meeting place in the step 300 may also be multiple. In this case, in step 31, the indication information acquired by the MCU should be a plurality of indications equal to the number of audio signals, and each indication information is used to indicate the video signal corresponding to the corresponding audio signal and the video area of the video object corresponding to the corresponding audio signal in the corresponding video signal.
Step 302, the MCU acquires the sign information of the dumb language corresponding to the audio signal and/or the basic identity information corresponding to each participant in the first meeting place;
preferably, in order to further facilitate communication between participants in each conference hall, and when a scene in which a deaf-mute person participates in an opposite conference hall is considered, before the MCU superimposes text information corresponding to the audio signal and a corresponding video signal, the MCU may further obtain mute gesture information corresponding to the current audio signal, and basic identity information corresponding to each participant in the first conference hall, that is, each video object included in the video signal, where the basic identity information is specifically in a text format, and the included content may specifically be basic information related to names, titles, and the like of each participant. Specifically, the method for acquiring the sign information of the mute language and the basic identity information of the participant by the MCU may also refer to the description of the corresponding step executed in the second meeting place in the above embodiment.
303, acquiring text information corresponding to the audio signal by the MCU;
in this step, after receiving the audio signal, the MCU may perform speech recognition processing on the audio signal locally to generate text information corresponding to the audio signal; or, the text information conversion operation can be performed in a first meeting place, and the first meeting place can perform voice recognition on the audio signal at a local end after acquiring the audio signal so as to generate text information corresponding to the audio signal, so that the audio signal is sent to the MCU and simultaneously the corresponding text information is sent to the MCU; or the text information may also be manually entered by the meeting administrator at the first meeting place. In practical applications, in consideration of different languages held by participants, no matter whether the text information is obtained in the first meeting place or the MCU, the audio signal can be converted into text information through speech recognition, or the conference administrator can select to convert the audio signal into a plurality of text information corresponding to different languages when inputting the text information manually, so as to display the caption information of different languages on the display screen.
304, the MCU superposes the basic identity information of the video object specified by the text information, the sign information of the mute language and/or the indication information and the video signal specified in the indication information;
after the text information corresponding to the audio signal, the indication information for indicating the video object corresponding to the audio signal, and the mute gesture information and/or the basic identity information are acquired, in this step, under the indication of the indication information, the MCU performs the overlay processing on the basic identity information corresponding to the video object specified in the text information, the mute gesture information, and/or the indication information and the video signal specified in the indication information in the specified video area, so as to overlay various auxiliary information around the video object corresponding to the audio signal in the video signal.
It should be noted that, if the MCU has acquired a plurality of audio signals from the first session in step 300 and acquired a plurality of indication information corresponding to the plurality of audio signals from the first session in step 301, the MCU should perform the overlay processing on the video signals at the designated video position by associating the text information corresponding to each audio signal with the video signal designated in the indication information according to the indication information corresponding to each audio signal, and preferably superimpose the text information and the mute gesture information and/or the basic identity information of the corresponding participant on the video signal at the same time.
305, the MCU respectively overlaps other video signals except the appointed video signal with the corresponding basic identity information;
step 306, the MCU sends the audio signals and the processed video signals to a plurality of second meeting places connected with the first meeting place;
step 307, the second conference room displays the video signals after the superposition processing on corresponding display screens respectively;
and after the acquired auxiliary information and the corresponding video signals are subjected to superposition processing, the MCU sends the audio signals and the processed video signals to a plurality of second meeting places which are in communication connection with the first meeting place. Therefore, for each second conference place, because each video signal received by the second conference place is superimposed with each type of auxiliary information, the second conference place does not need to perform additional processing on the received video signal, but can directly display a plurality of received video signals on the corresponding display screens, and in all the displayed auxiliary information, each auxiliary information is consistent with the orientation of the corresponding image.
Meanwhile, it should be noted that, for the MCU point-to-multipoint connection mode of the video conference system, although the acquisition of various auxiliary information and indication information and the superimposition processing of the video signal described in the above steps of this embodiment are all performed in the MCU, in practical applications, these steps may also be performed in the first meeting place or a plurality of second meeting places that have established connection with the first meeting place through the MCU, that is, the first meeting place acquires the indication information, and after performing the superimposition processing on the video signal and each auxiliary information under the indication of the indication information, the MCU sends the video signal after the superimposition processing to the plurality of second meeting places; or after the second conference room receives the unprocessed audio signal and the unprocessed video signal, the indication information and the auxiliary information are acquired so as to perform superposition processing on the video signal and the auxiliary information. Regardless of the implementation manner, the second conference room can obtain the above-mentioned effects described in the present embodiment after displaying the video signal after the superimposition processing.
And 308, the second meeting place plays the audio signal according to the sound source direction information corresponding to the indication information.
Further, in order to ensure that the sound of the speaking participant at the opposite terminal conference room played by the second conference room has the same directionality, i.e., the consistency, with the displayed image of the participant, in the embodiment of the present invention, the second conference room further processes the audio signal according to the sound source direction information corresponding to the indication information, so as to play the audio signal transmitted by the opposite terminal conference room according to the sound source direction information.
The method for superimposing auxiliary information on video signals of the embodiment is applied to a multi-video image conference system, before the MCU carries out superposition processing on the text information corresponding to the audio signal and the video signal, the indication information for indicating the video area where the video object corresponding to the current audio signal in the video signal is located is acquired, and when the MCU carries out superposition processing on the video signal, according to the indication information, the text information corresponding to the current audio signal is superposed with the video signal in the video area where the video object corresponding to the audio signal is located, so that when the video signal subjected to the superimposition processing is displayed on the display screen of the corresponding venue terminal, the text information corresponding to the audio signal can be ensured to be displayed around the image of the corresponding video object, and the consistency of the display directions of the image and the caption is ensured.
Further, in this embodiment, before the MCU superimposes the video signal sent by the peer conference room and the text information corresponding to the audio signal, the mute gesture information corresponding to the audio signal and/or the basic identity information of each participant in the peer conference room are obtained, and while the text information and the video signal corresponding to the participant currently in the speaking state are superimposed, the mute gesture information and/or the basic identity information are superimposed on the corresponding video object, so that not only the basic information of each participant in the peer conference room is displayed at the corresponding position of the display screen of the display peer conference room, but also the mute gesture with consistent speaking content of the speaker in the peer conference room is displayed at the corresponding position, which further facilitates communication between participants.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 6 is a schematic structural diagram of a first apparatus for superimposing auxiliary information on a video signal according to an embodiment of the present invention. As shown in fig. 6, the auxiliary information superimposing apparatus of a video signal of the present embodiment includes at least: a signal acquisition module 11, an indication information acquisition module 12 and a signal superposition module 13.
The signal acquiring module 11 is configured to acquire an audio signal of a first conference room and at least one video signal of the first conference room, where the at least one video signal includes a plurality of video objects in the first conference room; the indication information acquiring module 12 is configured to acquire indication information, where the indication information indicates a video area where a video object corresponding to the acquired audio signal is located, among the plurality of video objects of the at least one video signal acquired by the signal acquiring module 11; the signal superimposing module 13 is configured to perform, according to the indication information acquired by the indication information acquiring module 12, superimposing processing on the text information corresponding to the audio signal of the first session and the video signal acquired by the signal acquiring module 11, so that the text information is displayed in the video area indicated by the indication information.
Specifically, the auxiliary information superimposing apparatus for video signals of the present embodiment may be provided in a conference terminal or in an MCU. If the device is arranged at the meeting place terminal, the device of the embodiment can be arranged in the first meeting place, and performs corresponding information superposition processing on the video signal before the first meeting place sends the audio signal and the video signal to the second meeting place, or the device of the embodiment can also be arranged in the second meeting place, and performs corresponding information superposition processing on the video signal after the second meeting place receives the audio signal and the video signal sent by the first meeting place; if the device of this embodiment is disposed in the MCU, the device of this embodiment can perform corresponding information superposition processing on the video signal after receiving the audio signal and the video signal sent from any meeting place.
Specifically, the specific working processes related to all the modules in this embodiment may refer to the related contents disclosed in the related embodiments related to the method for superimposing auxiliary information on a video signal, and are not described herein again.
The auxiliary information superposition device of the video signal of the embodiment is applied to the video conference application scene of multiple images, by acquiring indication information indicating a video area in which a video object corresponding to a current audio signal in a video signal is located before superimposing text information corresponding to the audio signal on the video signal, and when superimposing the video signal, according to the indication information, the text information corresponding to the current audio signal is superposed with the video signal in the video area where the video object corresponding to the audio signal is located, so that when the video signal subjected to the superimposition processing is displayed on the display screen of the corresponding venue terminal, the text information corresponding to the audio signal can be ensured to be displayed around the image of the corresponding video object, and the consistency of the display directions of the image and the caption is ensured.
Fig. 7 is a schematic structural diagram of a second apparatus for superimposing auxiliary information on a video signal according to an embodiment of the present invention. As shown in fig. 7, in the device for superimposing auxiliary information on a video signal according to the present embodiment, on the basis of the previous embodiment, one or more video signals may be acquired by the signal acquisition module 11. When there are a plurality of video signals acquired by the signal acquiring module 11, the video area indicated by the indication information acquired by the indication information acquiring module 12 is a video position of a video object corresponding to the audio signal in a video corresponding to a first video signal, where the first video signal is a video signal corresponding to the audio signal in the plurality of video signals. If the video signal acquired by the signal acquiring module 11 is one, the video area indicated by the indication information acquired by the indication information acquiring module 12 is the video position of the video object corresponding to the audio signal in the video signal of the first meeting place.
The indication information obtaining module 12 may at least include any one of the following sub-modules: the first information acquisition sub-module 121 or the second information acquisition sub-module 122.
The first information obtaining submodule 121 is configured to, if the audio signal of the first meeting place is a multi-channel signal, determine that a direction corresponding to a channel signal with the largest energy in the multi-channel signal is a sound source direction of a video object corresponding to the audio signal, so as to generate sound source direction information of the audio signal, and convert the sound source direction information into indication information for indicating a video area where the video object corresponding to the audio signal is located by using a correspondence relationship between the sound source direction and the video direction; the second information obtaining sub-module 122 is configured to detect lip movements of participants in the video signal of the first meeting place, determine that the participants with the lip movements are video objects corresponding to the audio signals, and determine indication information of a video area where the video objects are located.
Further, in the present embodiment, the auxiliary information superimposing apparatus for a video signal may further include the auxiliary information acquiring module 14. The auxiliary information obtaining module 14 is configured to obtain the sign information of the mute corresponding to the audio signal and/or the basic identity information of each participant in the first meeting place before the signal superimposing module 13 superimposes the text information corresponding to the audio signal of the first meeting place and the video signal according to the indication information. Correspondingly, the signal superimposing module 13 in this embodiment may be further configured to: and performing superposition processing on the video signal and the mute gesture information acquired by the auxiliary information acquisition module 14 and/or the basic identity information of each participant in the first meeting place, so that the mute gesture information and/or the basic identity information of the video object are displayed in the video area indicated by the indication information.
Further, the auxiliary information superimposing apparatus for video signals according to the present embodiment may further include a signal display module 15. Specifically, the signal display module 15 is configured to display the video signal after the signal superposition module 13 performs superposition processing on the text information corresponding to the audio signal of the first session and the video signal according to the indication information, and then displays the video signal after the superposition processing on the corresponding display screen.
Furthermore, the auxiliary information overlaying device of the video signal of the present embodiment may further include any one of the first signal playing module 161 and the second signal playing module 162. The first signal playing module 161 is configured to play the audio signal according to the sound source position information of the audio signal after the signal superimposing module 13 superimposes the text information corresponding to the audio signal of the first meeting place and the video signal according to the indication information when the indication information is obtained by converting the correspondence between the sound source position and the video position according to the sound source position information of the audio signal; the second signal playing module 162 is configured to, when the indication information is detected according to the lip movement, obtain the sound source direction information of the audio signal by using the corresponding relationship between the sound source direction and the video direction after the signal superimposing module 13 superimposes the text information corresponding to the audio signal of the first meeting place and the video signal according to the indication information, and play the audio signal according to the sound source direction information of the audio signal.
Specifically, the specific working processes related to all the modules in this embodiment may also refer to the related contents disclosed in the related embodiments related to the method for superimposing auxiliary information on a video signal, and are not described herein again.
The auxiliary information superposition device of the video signal of the embodiment is applied to the video conference application scene of multiple images, by acquiring indication information indicating a video area in which a video object corresponding to a current audio signal in a video signal is located before superimposing text information corresponding to the audio signal on the video signal, and when superimposing the video signal, according to the indication information, the text information corresponding to the current audio signal is superposed with the video signal in the video area where the video object corresponding to the audio signal is located, so that when the video signal subjected to the superimposition processing is displayed on the display screen of the corresponding venue terminal, the text information corresponding to the audio signal can be ensured to be displayed around the image of the corresponding video object, and the consistency of the display directions of the image and the caption is ensured.
Further, in this embodiment, before the plurality of video signals sent by the opposite-end meeting place are displayed in the home-end meeting place, the mute gesture information corresponding to the audio signal and the basic identity information of each participant in the opposite-end meeting place corresponding to each video signal are acquired, and the mute gesture information and the basic identity information are superimposed on the corresponding video signals while the text information and the video signals corresponding to the participants currently in the speaking state are superimposed, so that not only is the basic information of each participant in the opposite-end meeting place displayed at the corresponding position of the display screen of the home-end meeting place, but also the mute gesture for displaying the speaking content of the speaker in the opposite-end meeting place at the corresponding position is realized, and smooth communication between the participants is further facilitated.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.