CN117640861A - Guide broadcast control method and device - Google Patents

Guide broadcast control method and device Download PDF

Info

Publication number
CN117640861A
CN117640861A CN202210979362.4A CN202210979362A CN117640861A CN 117640861 A CN117640861 A CN 117640861A CN 202210979362 A CN202210979362 A CN 202210979362A CN 117640861 A CN117640861 A CN 117640861A
Authority
CN
China
Prior art keywords
video
conference terminal
conference
video picture
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210979362.4A
Other languages
Chinese (zh)
Inventor
颜国雄
徐海
陈显义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202210979362.4A priority Critical patent/CN117640861A/en
Priority to PCT/CN2023/082003 priority patent/WO2024036945A1/en
Publication of CN117640861A publication Critical patent/CN117640861A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/268Signal distribution or switching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a method and a device for controlling guided broadcasting. The method is applied to a first conference terminal, wherein the first conference terminal is positioned at a first conference site, and the method comprises the following steps: the first conference terminal acquires a media stream; the first conference terminal acquires the identification of the first object, wherein the identification is extracted from the media stream; if the first object is an object in the first meeting place and the first video picture does not comprise the first object, the first meeting terminal generates a second video picture, the second video picture comprises the first object, and the first video picture and the second video picture are guide pictures of the first meeting place. Since the identification of the first object is extracted from the media stream, when the first object is an object in the first meeting place, the first object is an object expected to exist in the guide frame of the first meeting place. The first conference terminal can automatically generate the second video picture including the first object, thereby improving the intelligentization degree of the director control.

Description

Guide broadcast control method and device
Technical Field
The present disclosure relates to the field of video technologies, and in particular, to a method and apparatus for controlling a broadcast.
Background
With the continuous development of video technology, video technology is gradually combined with artificial intelligence technology. The intelligent broadcasting guiding technology is an important result of combining the video technology and the artificial intelligence technology, and is widely applied at present. When the intelligent guide technology is used, a plurality of cameras are generally used for shooting. The video shot by a plurality of cameras has the characteristic of multiple angles and is more comprehensive in appearance. Based on the intelligent guide technology, the guide equipment can automatically capture the interested pictures to play.
However, the existing intelligent broadcasting technology has far less intelligence for human beings with respect to image understanding capability, dialogue understanding capability and the like. Therefore, when the video picture is played, the playback apparatus may play the video picture inaccurately. For example, in a conference scenario, when a presenter calls for another person, it is desirable for the director device to automatically play the video of the callee. However, the automatic playing needs to be triggered only by the action of the called person, for example, the called person can be triggered only by lifting his or her hand, otherwise, the current video picture is still maintained. Therefore, in the existing intelligent multicast technology, insufficient control over multicast is a problem to be solved urgently.
Disclosure of Invention
The application provides a method and a device for pilot control, which are used for improving the intelligent degree of pilot control.
In a first aspect, the present application provides a method for controlling multicast. The method can be executed by the guide control device provided by the application, and the guide control device can be the electronic equipment provided by the application. The method is applied to a first conference terminal, wherein the first conference terminal is positioned at a first conference site, and the method comprises the following steps:
the first conference terminal acquires a media stream; the first conference terminal obtains an identifier of a first object, wherein the identifier is extracted from the media stream; if the first object is an object in the first meeting place and the first object is not included in the first video picture, the first meeting terminal generates a second video picture, the second video picture includes the first object, and the first video picture and the second video picture are guide pictures of the first meeting place.
In the method for controlling the multicast, the identifier of the first object is extracted from the media stream, and when the first object is an object in the first meeting place, the first object is an object expected to exist in the multicast frame of the first meeting place. When the first video picture does not comprise the first object, the first conference terminal in the first conference place can automatically generate a second video picture comprising the first object, and the first object is supplemented into the guide picture of the first conference place, so that the intelligent degree of guide control is improved.
The media stream in the method for controlling the multicast is not specifically limited herein. In one possible design, the media stream is a video stream and/or a voice stream, the identification of the first object is determined according to an image recognition result and/or a voice recognition result, the image recognition result is obtained by performing image recognition on a video picture corresponding to the video stream, and the voice recognition result is obtained by performing voice recognition on the voice stream. Thus, the identification of the first object can be determined according to the image recognition result and/or the voice recognition result, so that flexibility is increased.
The object in the multicast control method is not specifically limited herein. The object may be a person, an object, etc., such as a blackboard, a banner, a trophy, etc.
As can be seen from the description of the above method, the first conference terminal may require a certain precondition for generating the second video frame, for example, the first object is an object in the first conference room, and the first object is not included in the first video frame, and other preconditions are also possible. In one possible design, the identification of the first object is in a stored set of object information, the set of object information comprising information for a plurality of objects, the information for each of the objects comprising the identification of the object.
The first conference terminal may determine that the identity of the first object is in the stored set of object information through a variety of implementations. The specific implementation manner can be as follows:
in a possible implementation manner, the first conference terminal sends the media stream to a server; the first conference terminal receives indication information from the server, the indication information indicates that the identification extracted from the media stream is the identification of the first object, and the identification of the first object is in the object information set.
In another possible implementation manner, the first conference terminal extracts the identifier of the first object from the media stream; the first conference terminal finds information of an object including an identification of the first object in the object information set.
In addition, how the objects in the object information set are selected is not particularly limited. In one possible design, the first conference site is any one of a plurality of conference sites, the plurality of conference sites are conference sites of a conference to which the first conference terminal is connected, and the object information set includes information of objects in the plurality of conference sites. For example, when the objects in the object information set are all characters, the object information set may be participants in all meeting places of the meeting. Other selection methods are also possible for the objects in the object information set, for example, the object information set may be all the objects of the same organization. When the objects in the object information set are all characters, all employees in a company can be identified.
Thus, after the identification of the first object is determined to be in the stored object information set, the first object is determined to be the object in the object information set, and subsequent errors caused by the error in extracting the identification of the first object are avoided, so that the accuracy in extracting the identification of the first object can be increased.
In the method for controlling the multicast, the situation that the first conference terminal generates the second video picture may be various, which is specifically as follows:
in a possible case, the first conference terminal acquires a video picture of the first conference site, wherein the video picture of the first conference site includes all objects of the first conference site, and the second video picture is the video picture of the first conference site.
In another possible scenario, the first conference terminal obtains a close-up video picture of the first object; and the first conference terminal generates the second video picture according to the close-up video picture. In this way, the first conference terminal may generate the second video frame from the close-up video frame of the first object, so that the first object is more prominent in the display of the second video frame.
The object in the close-up video picture of the first object may be only the first object, or may include other objects, but the first object is the object with the largest display area in the close-up video picture.
The first conference terminal may acquire the close-up video picture of the first object in various manners, which may specifically be as follows:
in a possible implementation manner, a first conference terminal acquires a video picture of a first conference site, wherein the video picture of the first conference site comprises all objects in the first conference site; the first conference terminal intercepts a close-up video picture of the first object from video pictures of the first conference site.
In another possible implementation manner, the first conference terminal obtains location information of the first object at the first conference site; and the first conference terminal acquires the close-up video picture according to the position information.
In this way, the position of the first object can be located according to the position information of the first object, and the close-up video picture of the first object can be acquired more accurately.
The first conference terminal may acquire the position information of the first object at the first conference site in various manners, which may specifically be as follows:
in one possible design, before the first conference terminal obtains the location information of the first object at the first conference site, the method further includes: and the first conference terminal acquires the position information of each object in the first conference place.
Thus, the position information of each object in the first meeting place can be accurately obtained, and the position information of the first object can be obtained more quickly when the position information of the first object is required to be obtained.
In one possible design, the first conference terminal updates the location information of each object in the first conference site according to a preset period. For example, when Zhang San walks from the leftmost corner to the rightmost corner of the conference room, the position information of Zhang San is updated.
Therefore, the position information of each object in the first meeting place is updated according to the preset period, so that the position information of each object in the first meeting place can be tracked according to the preset period, and the position information of the first object is more accurate.
The first conference terminal may acquire the close-up video picture according to the location information in various manners, which may specifically be as follows:
in a possible implementation manner, a camera is set on the first conference terminal, and the first conference terminal controls the camera to acquire the close-up video picture according to the position information.
In another possible implementation manner, the first conference terminal may send an image capturing parameter to an image capturing device, where the image capturing parameter is obtained according to the location information, and the close-up video frame is obtained by the image capturing device according to the image capturing parameter; the first conference terminal receives the close-up video picture from the camera device.
In another possible implementation manner, the first conference terminal may also send the location information to a camera device, where the location information is used by the camera device to obtain a shooting parameter, and the shooting parameter is used by the camera device to obtain the close-up video picture; the first conference terminal receives the close-up video picture from the camera device.
The first conference terminal may also generate the second video picture according to the close-up video picture in various ways:
in a possible implementation manner, the second video frame is a close-up video frame of the first object.
In another possible implementation manner, the first conference terminal generates the second video picture according to the first video picture and the close-up video picture. For example, the first video frame and the close-up video frame may be spliced to obtain the second video frame, or the close-up video frame may be embedded in the first video frame.
In a possible scenario, a first conference terminal may determine that the first object is within the first conference site by:
the first conference terminal carries out target identification on the video picture of the first conference site to obtain an object set in the first conference site; and the first conference terminal determines that the first object is in the first conference place according to the object set in the first conference place.
There are various implementation manners of the first conference terminal for performing target recognition on the video frame of the first conference site, in one possible implementation manner, the first object is a participant of the first conference site, and the first conference terminal performs target recognition on the video frame of the first conference site, including: and the first conference terminal carries out face recognition on the video picture of the first conference site.
In the above manner, when the first object is a participant at the first conference site, the first object can be uniquely determined through face recognition, so that the target recognition can be more accurately performed on the video picture at the first conference site.
In another possible implementation manner, a participant in a first meeting place sets a corresponding nameplate, and the first meeting terminal performs target identification on a video picture of the first meeting place, including: and the first meeting terminal performs character recognition on the nameplate in the video picture of the first meeting place.
It should be noted that, there may be multiple situations in the object set in the first meeting place. In a possible case, the object set in the first meeting place is a text set of all nameplates in the first meeting place, the first object is in the object set in the first meeting place, and the identification of the first object is in the text set. In another possible scenario, the set of objects in the first venue is a set of biometric information of all objects in the first venue, the first object is in the set of objects in the first venue, and the biometric information of the first object is in the set of biometric information. In this way, since the biometric information can uniquely determine the object, the object is indicated by the biometric information, and whether there is biometric information of the first object in the set of biometric information is determined, it is possible to more accurately determine that the first object is in the first conference place.
The implementation manner of determining the first object in the first conference place according to the object set in the first conference place is various, which can be specifically as follows:
in a possible implementation manner, if the first conference terminal determines that the unique object corresponding to the identifier is the first object and the first object is in the object set in the first conference site, determining that the first object is in the first conference site; or if the first conference terminal determines that the identifier corresponds to a plurality of objects, wherein the plurality of objects comprise the first object, and the first object is in an object set in the first conference site, acquiring target biological characteristic information corresponding to the identifier; and if the first conference terminal determines that the target biological characteristic information is matched with the biological characteristic information of the first object, determining that the first object is in the first conference site. Thus, when the object set corresponding to one object is identified, the first object can be directly determined to be in the first meeting place when the first object is in the first meeting place; when the identification corresponds to a plurality of objects, the target biological characteristic information of the first object is matched with the biological characteristic information of the first object, and the first object can be directly determined to be in the first meeting place. Thus, whether the identification corresponds to one or more objects, the first object may be accurately determined to be within the first venue.
It should be noted that, the target biometric information corresponding to the identifier may also be extracted from the media stream. In a possible scenario, the media stream may comprise a voice stream and a video stream, the identification being extracted from the voice stream, the target biometric information being extracted from the video stream.
In addition, there are also a plurality of ways in which the first conference terminal determines that the first object is within the first conference room. For example, the first meeting place corresponds to a preset object set, the identification of the objects in the object information set is not repeated, and if the first meeting terminal determines that the first object is in the preset object set corresponding to the first meeting place, the first meeting terminal determines that the first object is in the first meeting place.
The first conference terminal may obtain not only video pictures in the first conference site, but also video pictures in other conference sites, one possible scenario is as follows:
the first conference terminal acquires a video stream from a second conference terminal, the second conference terminal is positioned at a second conference site, and a picture of the video stream from the second conference terminal comprises a second object in the second conference site; and the first conference terminal generates a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object. Therefore, the first conference terminal can generate the third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, so that the video pictures of a plurality of conference sites can be fused, and the broadcasting content is more comprehensive.
The layout of the third video frame is not limited herein. In a possible implementation manner, the third video picture includes a plurality of sub-pictures, where the plurality of sub-pictures includes a first sub-picture and a second sub-picture, and the first sub-picture and the second sub-picture include the first object and the second object, respectively; the sub-pictures are arranged in rows or columns, and the distances between two adjacent sub-pictures are the same. The third video frames may not be arranged according to rows and columns, or the distance between adjacent sub-frames may not be set, and may specifically be set according to actual requirements.
In another possible scenario, the first conference terminal obtains a video stream of the third video picture from a server.
The server can also realize the following functions:
the server acquires a video stream from the first conference terminal, encodes or decodes the video stream from the first conference terminal, and transmits the encoded video stream or the decoded video stream to the second conference terminal; the server acquires the video stream from the second conference terminal, encodes or decodes the video stream from the second conference terminal, and transmits the encoded video stream or the decoded video stream to the first conference terminal. In this way, the server may implement video stream switching of the first conference terminal and the second conference terminal, so that each conference terminal may obtain video streams of other conference terminals.
It should be noted that, in the above description, a method flow corresponding to when the first video frame does not include the first object is introduced. The first video frame may also include the first object, and the corresponding method flow may be as follows:
if the first conference terminal determines that the first object is an object in the first conference site and the first video picture comprises the first object, the first conference terminal sets the display duration of the first object; and in the display duration, the first conference terminal continuously acquires the close-up video picture of the first object. Because the close-up video picture of the first object is continuously acquired in the display duration, the first object can be ensured to always appear in the video picture of the first conference terminal in the display duration.
In addition, the first conference terminal may send a video stream to the other device, whether the first object is included or not included in the first video frame, as may be the following:
and if the first conference terminal determines that the first object is an object in the first conference site and the first video picture comprises the first object, the first conference terminal sends a video stream comprising the first video picture to a second conference terminal or a server.
And if the first conference terminal determines that the first object is an object in the first conference site and the first video picture does not comprise the first object, the first conference terminal sends a video stream of the second video picture to the second conference terminal or the server.
In a second aspect, the present application provides a multicast control system. The system comprises a first conference terminal and a second conference terminal, wherein the first conference terminal is positioned at a first conference site, and the second conference terminal is positioned at a second conference site; the first conference terminal is used for acquiring a media stream; the first conference terminal is configured to obtain an identifier of a first object, where the identifier is extracted from the media stream; if the first object is an object in the first meeting place and the first object is not included in the first video picture, generating a second video picture, wherein the second video picture comprises the first object, and the first video picture and the second video picture are guide pictures of the first meeting place; and the second conference terminal is used for acquiring the video stream of the second video picture.
In one possible design, the system further includes a server;
The server is configured to obtain a video stream of the second video frame and a video stream from a second conference terminal, where a frame of the video stream from the second conference terminal includes a second object of the second conference site; and generating a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object.
In one possible design, the server is further configured to obtain a video stream from the first conference terminal, encode or decode the video stream from the first conference terminal, and send the encoded video stream or the decoded video stream to the second conference terminal; and/or the server is further configured to acquire a video stream from the second conference terminal, encode or decode the video stream from the second conference terminal, and send the encoded video stream or the decoded video stream to the first conference terminal.
In one possible design, the server is configured to obtain a media stream, and extract an identifier of the first object from the media stream; determining that the identification of the first object is in a stored object information set, wherein the object information set comprises information of a plurality of objects, and the information of each object comprises the identification of the object; and sending indication information to the first conference terminal, wherein the indication information indicates that the identification extracted from the media stream is the identification of the first object, and the identification of the first object is in an object information set.
It should be noted that, the servers in the system may be deployed independently, or may be deployed in a distributed manner, where the servers may be deployed according to different functions during the distributed deployment. For example, the servers may be deployed as a first server and a second server. The first server is used for forwarding and decoding or encoding the video stream; the second server is configured to extract the identification in the media stream and determine whether the identification is in the set of object information.
In one possible design, the first conference terminal is a master control terminal, and the second conference terminal is a slave control terminal;
the second conference terminal is further configured to obtain an identifier of the media stream and a second object, where the identifier of the second object is extracted from the media stream, and the second object is an object in the second conference room; and acquiring the video stream of the close-up video picture of the second object, and sending the video stream of the close-up video picture of the second object to the first conference terminal.
In a third aspect, the present application provides a multicast control device, where the device is a first session terminal, and the device is located in a first session. The device comprises: the acquisition module is used for acquiring the media stream; acquiring an identifier of a first object, wherein the identifier is extracted from the media stream; and the processing module is used for generating a second video picture if the first object is an object in the first meeting place and the first object is not included in the first video picture, wherein the second video picture comprises the first object, and the first video picture and the second video picture are guide pictures of the first meeting place.
In one possible design, the identification of the first object is in a stored set of object information, the set of object information comprising information for a plurality of objects, the information for each of the objects comprising the identification of the object.
In one possible design, the first conference site is any one of a plurality of conference sites, the plurality of conference sites are conference sites of a conference to which the first conference terminal is connected, and the object information set includes information of objects in the plurality of conference sites.
In one possible design, the media stream is a video stream and/or a voice stream, the identification of the first object is determined according to an image recognition result and/or a voice recognition result, the image recognition result is obtained by performing image recognition on a video picture corresponding to the video stream, and the voice recognition result is obtained by performing voice recognition on the voice stream.
In one possible design, the acquisition module is further configured to: acquiring a close-up video picture of the first object; the processing module is specifically configured to: and generating the second video picture according to the close-up video picture.
In one possible design, the acquisition module is specifically configured to: acquiring position information of the first object at the first meeting place; and the first conference terminal acquires the close-up video picture according to the position information.
In one possible design, the processing module is further configured to: performing target recognition on the video picture of the first meeting place to obtain an object set in the first meeting place; and determining that the first object is in the first conference place according to the object set in the first conference place.
In one possible design, the processing module is specifically configured to: if the unique object corresponding to the identifier is determined to be the first object and the first object is in the object set in the first meeting place, determining that the first object is in the first meeting place; or if the identification corresponds to a plurality of objects, wherein the objects comprise the first object and the first object is in an object set in the first meeting place, acquiring target biological characteristic information corresponding to the identification; and if the target biological characteristic information is matched with the biological characteristic information of the first object, determining that the first object is in the first meeting place.
In one possible design, the acquisition module is further configured to: acquiring a video stream from a second conference terminal, wherein the second conference terminal is positioned at a second conference site, and a picture of the video stream from the second conference terminal comprises a second object in the second conference site; and generating a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object.
In one possible design, the third video picture includes a plurality of sub-pictures including a first sub-picture and a second sub-picture, the first sub-picture and the second sub-picture including the first object and the second object, respectively; the sub-pictures are arranged in rows or columns, and the distances between two adjacent sub-pictures are the same.
In a fourth aspect, there is provided an electronic device comprising: one or more processors; one or more memories; wherein the one or more memories store one or more computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of the first aspects above.
In a fifth aspect, there is provided a computer readable storage medium comprising computer instructions which, when run on a computer, cause the computer to perform the method of any one of the first aspects above.
In a sixth aspect, the present application provides a chip comprising a processor coupled to a memory for reading and executing a software program stored in the memory to implement the method of any one of the first aspects.
In a seventh aspect, the present application provides a computer program product which, when read and executed by a computer, causes the computer to perform the method of any one of the first aspects above.
The advantages of the second aspect to the seventh aspect are described above with reference to the advantages of the first aspect, and the description is not repeated.
Drawings
Fig. 1 is a schematic architecture diagram of a multicast control system according to an embodiment of the present application;
fig. 2A is a schematic block diagram of an AI server in an multicast control system according to an embodiment of the present application;
fig. 2B is a schematic block diagram of a media server in a multicast control system according to an embodiment of the present application;
fig. 3A is a schematic diagram of an interaction flow of each device in the multicast control system according to the embodiment of the present application;
fig. 3B is a schematic step flow diagram of a multicast control method according to an embodiment of the present application;
fig. 4A to fig. 4B are schematic views showing a display effect of a third video frame in the method for controlling a multicast according to the embodiments of the present application;
fig. 5 is a schematic diagram of video streaming transmission in a server full-adaptation mode in the method for controlling multicast according to the embodiment of the present application;
fig. 6 is a schematic step flow diagram of another method for controlling multicast according to an embodiment of the present application;
Fig. 7 is a schematic step flow diagram of another method for controlling multicast according to an embodiment of the present application;
fig. 8 is a schematic diagram of a display effect of a third video frame in another method for controlling a multicast according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a multicast control device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The terminology used in the following embodiments is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary. It should also be understood that in embodiments of the present application, "one or more" refers to one or more than two (including two); "and/or", describes an association relationship of the association object, indicating that three relationships may exist; for example, a and/or B may represent: a alone, a and B together, and B alone, wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
The term "plurality" in the embodiments of the present application means two or more, and for this reason, "plurality" may be also understood as "at least two" in the embodiments of the present application. "at least one" may be understood as one or more, for example as one, two or more. For example, including at least one means including one, two or more, and not limiting what is included. For example, at least one of A, B and C is included, then A, B, C, A and B, A and C, B and C, or A and B and C may be included. Likewise, the understanding of the description of "at least one" and the like is similar. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/", unless otherwise specified, generally indicates that the associated object is an "or" relationship.
Unless stated to the contrary, the embodiments of the present application refer to ordinal terms such as "first," "second," etc., for distinguishing between multiple objects and not for defining a sequence, timing, priority, or importance of the multiple objects.
Application scenario
Fig. 1 is a schematic diagram of an architecture of a multicast control system according to an embodiment of the present application.
The system architecture shown in fig. 1 includes a conference terminal 10, servers (AI server 20 and media server 30).
The conference terminal 10 may include User Equipment (UE), wireless terminal devices, mobile terminal devices, user terminal devices (user terminal), computer devices, and the like. The conference terminal 10 may be a split terminal device and an integrated terminal device. Wherein, split type terminal equipment, terminal host computer and other equipment are the components of a whole that can function independently, and other equipment can include at least one of following: video input devices (e.g., cameras), video output devices (e.g., display screens), audio input devices (e.g., microphones (MIC)), audio output devices (e.g., speakers), etc. The integrated terminal device means that the terminal device comprises a processor (central processing unit, CPU), a video input device, a video output device, an audio input device, an audio output device, etc.
The conference terminal 10 is located in a conference site, e.g. the first conference terminal 10 is located in a first conference site. The first conference site refers to a collection range of a video collection device, such as a conference room, associated with the first conference terminal 10. The acquisition range of the video acquisition device refers to a set of spatial ranges that the video acquisition device can acquire. The video capturing device associated with the first conference terminal 10 refers to a video capturing device capable of data interaction with the first conference terminal 10, and for example, includes a video capturing device provided on the first conference terminal 10 and/or a video capturing device provided separately from the first conference terminal 10 but capable of communicating with the first conference terminal 10. The video capturing device is, for example, a video camera, or a still camera. For example, the video capture devices associated with the first conference terminal 10 may include a first video capture device disposed on the first conference terminal 10 and a second video capture device disposed independently of the first conference terminal 10, and then the first conference site may include the capture range of the first video capture device and the capture range of the second video capture device. It should be noted that there may be a plurality of conference terminals 10 in one conference place, and in this application, one conference terminal 10 is taken as an example for illustration. One or more objects may be included in the first venue, such as meeting attendees, blackboards, trophies, etc.
The server may be a physical server, a virtual server running on a computer device, or the like. The number of servers may be one or more. For example, the server may be a single server, or may be a server node in a server cluster. The servers can be distributed or centralized. For example, the servers may be deployed in a centralized manner, or may be deployed in a distributed manner according to different functions. In the architecture shown in fig. 1, servers that deploy different functions in a distributed manner are illustrated as an example. The server shown in fig. 1 includes: artificial intelligence (artificial intelligence, AI) server 20 and media server 30. The AI server 20 may be used to implement AI functionality and the media server 30 to transport media streams. In some cases, the AI server 20 may not be included in the multicast control system, or the media server 30 may not be included. For example, when the conference terminal 10 has the AI function, the multicast control system may not include the AI server 20; the conference terminals 10 can mutually transmit media streams, and the media server 30 is not needed to be used by the guiding control system, so that the guiding control system does not need to include the media server 30.
The functions of the conference terminal 10 and the server are described in detail below.
Taking the first session terminal 10 as an example, the first session terminal 10 is configured to obtain a media stream.
The media stream may include one or more of a video stream, a voice stream, image information, text information, bullet screen information, information of a shared interface, or screen-cast information. The first conference terminal 10 may acquire media streams from the first conference site, or may acquire media streams acquired by other conference terminals 10, for example, from the second terminal device.
The first conference terminal 10 is also arranged to send a media stream to other conference terminals 10 or servers. For example, the first conference terminal 10 may directly transmit a media stream to the second conference terminal 10, or may transmit a media stream to a server, and then the server transmits a media stream to the second conference terminal 10.
The first conference terminal 10 is also used for generating video pictures. The video frames generated by the first conference terminal 10 may be directly played or may not be played, and only the video stream of the generated video frames may be transmitted to other conference terminals 10 or servers. The video picture may be a guide picture of the first conference site, which is a video picture of the first conference site output by the first conference terminal 10 to the other conference terminals 10. The first conference terminal 10 may add an identifier to the corresponding video stream of the pilot frame, to indicate that the video frame corresponding to the video stream is the pilot frame of the first conference. The guide picture of the first meeting place is used for presenting key objects of the first meeting place. And which object or objects in the first conference site are key objects, the first conference terminal 10 may be determined by the director policy settings. For example, the director policy of the first conference terminal 10 may be configured to include a default mode and a control mode. In the default mode, the first conference terminal 10 generates a first video picture, which is a guide picture of the first conference site, and the first video picture may be generated according to a specific configuration of the default mode. In the default mode, the key object may be configured as one or more default objects, and then the one or more default objects are included in the first video frame, and the first video frame may be a close-up video frame of the one or more default objects. When the default mode trigger condition is satisfied, the default mode may be switched to the control mode. When a first object in the first conference place is mentioned in the media stream (e.g. the voice stream) and the first video picture does not comprise the first object, the first conference terminal 10 triggers the control mode and the key object is the first object. The first conference terminal 10 may generate a second video picture, which is a director picture of the first conference site, including the first object, e.g., a close-up video picture of the first object.
The first conference terminal 10 may also generate a combined guide picture, which is a guide picture of an object including a plurality of conference sites, in the control mode. For example, the first conference terminal 10 may also generate a third video frame including the first object and the second object, the second object being located at the second conference site. The first conference terminal 10 may generate a third video picture from the pilot picture of the first conference site and the pilot picture of the second conference site from the second conference terminal 10.
From the above description, it is clear that one possible reason for the first conference terminal 10 to switch from the default mode to the control mode is that the first object is mentioned in the media stream, and thus how to determine that the first object is mentioned in the media stream is the key of the multicast control. In the multicast control system provided in the present application, the first conference terminal 10 may determine whether the first object is mentioned in the media stream, or may determine whether the first object is mentioned in the media stream by interacting with the server. The functions of the servers are described below with the server distributed deployment as the AI server 20 and the media server 30.
The AI server 20 is configured to obtain a media stream and extract an identification of a first object in the media stream. There are various ways of extracting the identifier of the first object in the media stream, as described above, the media stream may include one or more of a video stream, a voice stream, image information, text information, bullet screen information, information of a shared interface, or screen-projection information, and these information may be used to extract the identifier of the first object. Take the media stream as video stream and/or voice stream as an example. The identification of the first object is determined according to an image recognition result and/or a voice recognition result, the image recognition result is obtained by performing image recognition on a video picture corresponding to the video stream, and the voice recognition result is obtained by performing voice recognition on the voice stream. For example, the identifier of the first object is the name of the person, "Zhang san", the media stream may introduce a video for the person of Zhang san, obtain biometric information of Zhang san, and obtain, according to the correspondence between the biometric information and the name of the person, the identifier of the first object is the name of the person, "Zhang san". The media stream may also be a section of voice stream "Zhang San," your opinion, "or may all be. The media stream may also have various situations, for example, the media stream may further include text information "Zhang san" on the video screen, and the AI server 20 may perform semantic recognition on the text information, and extract the identifier "Zhang san" of the first object in the text information.
The function of extracting the identifier of the first object in the media stream in the AI server 20 may be implemented by a deployment function module, as shown in fig. 2A, and multiple types of identification modules may be deployed in the AI server 20. For example, the biometric recognition module 201, the voice recognition module 202, and the semantic recognition module 203 may be deployed. The biometric identification module 201 is configured to perform biometric identification on the first object according to the video stream, so as to obtain biometric information and a corresponding identifier; the voice recognition module 202 is configured to perform voice recognition on the first object according to the voice stream; the semantic recognition module 203 is configured to extract the identifier "Zhang San" of the first object from the text information. The above modules in the AI server 20 may be implemented in software or in hardware. The modules are realized by software, and can also be realized by a chip. It is also possible that a part of the modules are implemented in software and another part in hardware.
The AI server 20 is configured to determine whether the identity of the first object is in a stored set of object information, the set of object information including information for a plurality of objects, the information for each of the objects including the identity of the object. The specific form of the object information set is not limited. For example, the set of object information is a data table in the database, the information of each object may be a data record in the data table, and the identification of the object is a field in the data record. When the objects in the object information set are all characters, the identification of the objects may be names, and other fields are not limited, for example, other fields may include biometric information of the objects, and the like.
How the objects in the object information set are selected is not particularly limited either. For example, the first conference site is any one of a plurality of conference sites, where the conference sites are conference sites to which the first conference terminal 10 accesses, and the object information set includes information of objects in the plurality of conference sites. For example, when the objects in the object information set are all characters, the object information set may be participants in all meeting places of the meeting. Other selection methods are also possible for the objects in the object information set, and the object information set may be all objects of the same organization, for example, all employees in a company.
The AI server 20 may also be used to send indication information to the first conference terminal 10. The indication information indicates an identification of the first object extracted from the media stream, and the identification of the first object is in the set of object information. It should be noted that, it is determined whether the identifier of the first object may not pass through the AI server 20 in the stored object information set, or may be extracted from the media stream by the first session terminal 10, for example, the first session terminal 10, and the information of the object including the identifier of the first object is found in the object information set. In this manner, the first conference terminal may determine whether the first object is mentioned in the media stream by interactively determining with the AI server 20, or by itself.
If the first conference terminal determines that the first object is mentioned in the media stream, it may determine whether to generate a second video picture including the first object. For example, the first conference terminal 10 may also be used to determine whether the first object is within the first conference site and whether the first object is included in the first video picture. The first conference terminal 10 generates a second video picture after determining that the first object is within the first conference site and when the first object is not included in the first video picture. The specific process will be described in detail later.
After the first conference terminal 10 generates the second video picture, in order to allow the other conference terminals 10 to obtain the second video picture as well, the first conference terminal 10 may also be used to send a video stream of the second video picture to the other conference terminals 10 or the media server 30.
The media server 30 is used for decoding or encoding of media streams and for forwarding of media streams. The media server 30 may decode or encode the acquired media stream and then forward the decoded or encoded media stream, or may forward the acquired media stream directly. In addition, the media server 30 may also synthesize the acquired media stream and forward the media stream, and in the following, a few typical modes of forwarding the media stream by the media server 30 will be described in detail by taking the example that the media stream is a video stream.
Full switching mode:
the media server 30 acquires the video stream of each conference terminal 10 and transmits the video stream of the conference terminal 10 other than the own conference terminal 10 to each conference terminal 10, and in this mode, the video stream of each conference terminal 10 may be encoded or decoded and then transmitted, or may be transmitted directly. For example:
the server acquires a video stream from the first conference terminal 10, encodes or decodes the video stream from the first conference terminal 10, and transmits the encoded video stream or the decoded video stream to the second conference terminal 10. The server acquires a video stream from the second conference terminal 10, encodes or decodes the video stream from the second conference terminal 10, and transmits the encoded video stream or the decoded video stream to the first conference terminal 10.
In the full switching mode, the first conference terminal 10 may also be adapted to generate a third video picture from the second video picture and the video stream from said second conference terminal 10. For example, the first conference terminal 10 may acquire a video stream of the fourth video picture, a video stream of the fifth video picture, and generate the third video picture from the video stream of the second video picture, the video stream of the fourth video picture, and the video stream of the fifth video picture. The video stream of the fourth video picture is a video stream from the second conference terminal 10, the fourth video picture includes a second object, and the second conference terminal 10 and the second object are located at a second conference site; the video stream of the fifth video picture is a video stream from the third conference terminal 10, the fifth video picture includes a third object, and the third conference terminal 10 and the third object are located at a third conference site; the third video picture includes a first object, a second object, and a third object.
Full adaptation mode:
the media server 30 acquires the video streams of the conference terminals 10, generates a video stream of the combined guide frame, and transmits the video stream of the combined guide frame to the conference terminals 10. In this mode, the video streams of the conference terminals 10 may be encoded or decoded and then transmitted, or may be transmitted directly. For example:
the media server 30 obtains a video stream from the second conference terminal 10, and a second object in the second conference site is included in a picture of the video stream from the second conference terminal 10; the media server 30 generates a third video frame from the video stream of the second video frame and the video stream from the second conference terminal 10, the third video frame including the first object and the second object; the media server 30 transmits a video stream of the third video picture to the first conference terminal 10 and the second conference terminal 10.
Hybrid mode:
the media server 30 acquires video streams of the conference terminals 10 and generates a video stream of a combined guide frame. The media server 30 transmits a video stream of a non-own conference terminal 10 to one part of the conference terminals 10, and transmits a video stream of a combined guide frame to another part of the conference terminals 10. For example:
The media server 30 sends a video stream of the third video picture to the first conference terminal 10 and a video stream of the second video picture to the second conference terminal 10.
The functions of acquiring and transmitting the media stream in the media server 30 may be implemented by a deployment function module. The modules deployed in the media server 30 may be in software or hardware. It is also possible that a part of the modules are implemented in software and another part in hardware. For example, as shown in fig. 2B, when the media server 30 is a multipoint control unit (multipoint control unit, MCU), a Multipoint Controller (MC) 301 and a Multipoint Processor (MP) 302 are disposed in the MCU. Wherein the MC301 is used for signaling exchange and signaling control, and the MP302 is used for acquiring and transmitting media streams.
The functionality of the conference terminal 10 and the server may be embodied in the interactive flow of the devices in the lead control system. As shown in fig. 3A, for the schematic diagram of the interaction flow of each device in the multicast control system provided in the embodiment of the present application, fig. 3A only shows one possible flow of interaction of each device in the multicast control system, and other possible interaction flows may be obtained based on the deformation of the interaction flow in fig. 3A.
Step 301: the first conference terminal obtains the media stream.
Step 302: the first conference terminal sends a media stream to the AI server.
Step 303: the AI server extracts an identification of a first object in the media stream and determines whether the identification of the first object is in a stored set of object information.
Step 304: the AI server sends indication information to the first conference terminal.
Steps 302-304 are optional steps, e.g. the identification of the first object in the media stream may be extracted by the first conference terminal, determining if the identification of the first object is in the stored set of object information.
Step 305: the first conference terminal determines that the first object is in the first conference site, and the first object is not included in the first video picture.
It should be noted that step 305 is an optional step, and only one possible scenario is described. For example, the first conference terminal may determine that the first object is not within the first conference site, and may also determine that the first object is within the first conference site and the first object is included in the first video frame.
Step 306: the first conference terminal generates a second video picture including the first object.
Step 307: the second conference terminal sends a video stream of a fourth video picture to the media server, wherein the fourth video picture comprises a second object.
Step 308: the third conference terminal sends a video stream of a fifth video picture to the media server, wherein the fifth video picture comprises a third object.
Step 309: the media server sends a video stream of the fourth video picture and a video stream of the fifth video picture to the first conference terminal.
Step 310: the first conference terminal generates a third video picture according to the video streams of the second video picture, the fourth video picture and the fifth video picture. Wherein the third video frame includes a first object, a second object, and a third object. Steps 307 to 310 are optional steps, and in steps 307 to 310, the operation mode of the media server is a full exchange mode. Accordingly, in other working modes of the media server, there may be different interaction flows, and the interaction flows corresponding to the other working modes may refer to the descriptions of steps 307 to 310.
Fig. 3A illustrates a method for controlling multicast according to an embodiment of the present application from a multi-device interaction perspective. In order to describe the implementation procedure of some steps shown in fig. 3A in more detail, a multicast control method provided in the embodiment of the present application is described in detail below from the perspective of the first conference terminal 10. Fig. 3B is a flowchart illustrating steps of a method for controlling multicast according to an embodiment of the present application. The steps shown in fig. 3B may be performed by the conference terminal 10 shown in fig. 1, for example, by a first conference terminal, and the specific steps may be as follows:
Step 401: the first conference terminal obtains the media stream.
Step 402: the first conference terminal determines whether an identification of the first object is in the set of object information.
The identification of the first object is extracted from the media stream.
If yes, go to step 403; otherwise, ending the flow.
Step 403: the first conference terminal determines whether the first object is within the first conference site.
If yes, go to step 404; otherwise, ending the flow.
Step 404: the first conference terminal determines whether the first video picture includes a first object.
The first video picture is a guide picture of the first meeting place.
If yes, go to step 405-406; otherwise, steps 407 to 408 are performed.
Step 405: and within the display duration, the first conference terminal acquires the close-up video picture of the first object.
Step 406: the first conference terminal sends a video stream of a close-up video picture of the first object to the second conference terminal.
Step 407: the first conference terminal generates a second video picture.
Step 408: the first conference terminal sends a video stream of the second video picture to the second conference terminal.
Step 409: the first conference terminal obtains a video stream of the fourth video picture.
Step 410: the first conference terminal generates a third video picture.
The media stream obtained in step 401 is not particularly limited herein. The media stream may be a video stream and/or a voice stream. The identification of the first object may be determined according to an image recognition result of the video stream and/or a voice recognition result of the voice stream. The first object and the identifier are not particularly limited herein. Taking the first object as an example, the first object may be a person or an object, such as a blackboard, a banner, a trophy, or the like. Taking the first object as a person as an example, the identifier of the first object may be a name, a position, a job number, etc. of the first object. For example, the name of the first object is Zhang three, the job position is the market segment manager, and the job number is S1001.
The image recognition result is obtained by performing image recognition on a video picture corresponding to the video stream. For example, the video stream is a person introduction video of three, and the person in the video picture corresponding to the video stream can be biologically identified to obtain the biological characteristic information. The biometric information may be, for example, facial feature information, iris feature information, and the like. The voice recognition result is obtained by performing voice recognition on the voice stream. For example, the voice stream is "Zhang San", your mindset "from which the name Zhang San can be extracted.
Step 402 is an optional step, for example, the first terminal may perform step 403 after acquiring the identifier of the first object, and may not need to determine whether the identifier of the first object is in the stored object information set.
The set of object information in step 402 may be stored locally at the first conference terminal, the set of object information comprising information of a plurality of objects, each piece of information of the object comprising an identification of the object. The information of the object may include other attribute information in addition to the identification of the object, for example, biometric information of the object. In addition, how the objects in the object information set are selected is not particularly limited either. In one possible design, the first conference site is any one of a plurality of conference sites, the plurality of conference sites are conference sites of a conference to which the first conference terminal is connected, and the object information set includes information of objects in the plurality of conference sites. For example, when the objects in the object information set are all characters, the object information set may be participants in all meeting places of the meeting. Other selection methods are also possible for the objects in the object information set, for example, the object information set may be all the objects of the same organization. When the objects in the object information set are all characters, all employees in a company can be identified.
For example, the object information set is information of all participants in one company a. The staff of company a has 20 persons, the participant has 8 persons, and the form of the object information set can be as shown in table 1:
table 1: object information set (identification configured to allow repetition)
Identification of objects Biometric information
Zhang San Biometric information 1
Li Si Biometric information 2
Wang Wu Biometric information 3
Zhang San Biometric information 4
Zhao Liu Biometric information 5
Xiao Hong Biometric information 6
Little bright Biometric information 7
Xiaoming (Ming) Biometric information 8
Table 2: object information set (identification is configured not to allow repetition)
The identification in the set of object information may be configured to allow repetition or may be configured to not allow repetition. In the example shown in table 1, the identification in the object information set is allowed to be repeated, and two or three of the identification of the objects can be performed. The identification in the set of object information may also be configured not to allow duplication, even if the name of the object itself is a rename, it may be distinguished by numbering or the like. As shown in table 2, the face feature information 1 corresponds to Zhang san 1, and the face feature information 4 corresponds to Zhang san 2. In this case, the voice stream may be "Zhang San1, your opinion".
The specific implementation of step 402 may be as follows:
in one possible implementation, the set of object information may be stored in a server. The first conference terminal sends the media stream to a server, and the server extracts the identification of the first object from the media stream, and searches the object information set for whether the information of the object including the identification of the first object exists. The first conference terminal receives indication information from the server, the indication information indicates that the identification extracted from the media stream is the identification of the first object, and the identification of the first object is in the object information set.
The form of the indication information is not limited herein. For example, when the media stream is a voice stream, the indication information may be "Zhang Sano". When the media stream is a voice stream and a video stream, the indication information may be "Zhang San" and face feature information 1".
In another possible implementation, the set of object information may be stored in the first conference terminal. The first conference terminal extracts the identification of the first object from the media stream; the first conference terminal finds information of an object including an identification of the first object in the object information set.
The specific implementation of step 403 may be as follows:
the first conference terminal carries out target identification on the video picture of the first conference site to obtain an object set in the first conference site; and the first conference terminal determines that the first object is in the first conference place according to the object set in the first conference place.
There are various implementation manners of the first conference terminal for performing target recognition on the video frame of the first conference site, in one possible implementation manner, the first object is a participant of the first conference site, and the first conference terminal performs target recognition on the video frame of the first conference site, including: and the first conference terminal carries out face recognition on the video picture of the first conference site.
For example, the first conference terminal may detect that two images exist on the video frame of the first conference site, perform face recognition on the two images, and determine that the object in the first conference site is the object corresponding to the face feature information 3 and the object corresponding to the face feature information 4. In this implementation, each object identified is a uniquely determined object.
In another possible implementation manner, a participant in a first meeting place sets a corresponding nameplate, and the first meeting terminal performs target identification on a video picture of the first meeting place, including: and the first meeting terminal performs character recognition on the nameplate in the video picture of the first meeting place.
For example, the first conference terminal performs text recognition on the nameplate in the video picture of the first conference place, and recognizes the characters "Zhang Sang", "Liqu". In such an implementation, each object identified may not be a uniquely determined object.
It should be noted that, there may be multiple situations in the object set in the first meeting place. In a possible case, the information of the same identified object does not exist in the object information set, the object set in the first meeting place is a text set of all nameplates in the first meeting place, the first object refers to the object set in the first meeting place, and the identification of the first object is in the text set.
Taking the object information set shown in table 2 as an example, the object set in the first meeting place is Zhang three 1 and Li four; the first object is identified as Zhang San1, then the first object is in the set of objects within the first venue.
In another possible scenario, there may be information of the same identified object in the object information set, where the object set in the first meeting place is a biometric information set of all objects in the first meeting place, the first object is in the object set in the first meeting place, and biometric information of the first object is in the biometric information set.
Taking the object information set shown in table 1 as an example, the first conference terminal recognizes all people, and the biological characteristic information set is biological characteristic information 1 corresponding to Zhang three and biological characteristic information 2 corresponding to Lisi four; when the first object is obtained as the biometric information 1, it can be determined that the biometric information 1 is in the biometric information set.
The implementation manner of determining the first object in the first conference place according to the object set in the first conference place is various, which can be specifically as follows:
in a possible implementation manner, if the first conference terminal determines that the unique object corresponding to the identifier is the first object and the first object is in the object set in the first conference site, the first object is determined to be in the first conference site.
Taking the object information set shown in table 1 as an example, when the identifier of the first object is Li Sishi, since the identifier of the first object corresponds to the unique biometric information 2 in the object information set, it is explained that the identifier of the first object (the identifier of the second object) corresponds to the first object (the identifier of the second object). And in the object set of the first meeting place, the fourth Li states that the fourth Li in the first meeting place is the first object, namely the fourth Li is in the first meeting place.
In another possible implementation manner, if the first conference terminal determines that the identifier corresponds to a plurality of objects, where the plurality of objects includes the first object, and the first object is in an object set in the first conference site, the target biometric information corresponding to the identifier is obtained; and if the first conference terminal determines that the target biological characteristic information is matched with the biological characteristic information of the first object, determining that the first object is in the first conference site.
Taking the object information set shown in table 1 as an example, when the first object is identified as Zhang san, the corresponding biometric information in the object information set includes biometric information 1 and biometric information 4, and the object set of the first conference site includes biometric information 1. Then the target biometric information of the third item needs to be acquired again to further determine whether the first object is the third item corresponding to biometric information 1. For example, when the target biometric information is acquired as biometric information 1', the biometric information 1' matches the biometric information 1. It can be stated that the first object is Zhang San in the first conference site, that is, zhang San principal corresponding to the biometric information 1 is in the first conference site.
It should be noted that, the target biometric information corresponding to the identifier of the first object may also be extracted from the media stream. For example, when a media stream may include a voice stream and a video stream, the identification is extracted from the voice stream and the target biometric information is extracted from the video stream.
The specific implementation of step 403 described above is illustrated by way of example only. There are also a number of ways in which the first conference terminal determines that the first object is within the first conference site. For example, the first meeting place corresponds to a preset object set, the identification of the objects in the object information set is not repeated, and if the first meeting terminal determines that the first object is in the preset object set corresponding to the first meeting place, the first meeting terminal determines that the first object is in the first meeting place.
Step 404: the first conference terminal determines whether the first video picture includes a first object.
The first video picture is a guide picture of the first meeting place. The guide picture of the first conference site is a video picture of the first conference site output by the first conference terminal to other conference terminals. Can be used to represent key objects in a first venue and guide viewers at other venues to view the first venue. The key objects in the first conference site may be determined according to a director policy of the first conference terminal. For example, the director policy may be: when the identification of the first object is not acquired or the first object is not in the first meeting place, the key object in the first meeting place is a default object (for example, a responsible person of a department) in the first meeting place; the identification of the first object is obtained, and when the first object is in the first meeting place, the key object in the first meeting place is the first object. Obviously, when the media stream extracts the identification of the first object, and the first object is at the first conference site, the first conference terminal is required to output the video picture including the first object. If there is no first object in the lead frame (first video frame) of the first venue, it is indicated that the first object needs to be supplemented into the lead frame of the first venue.
For example, zhang San (corresponding to biometric information 1) and Lifour are located in the first session, the first object is identified as "Zhang San", the first video is a close-up video of Lifour, and Zhang San is not in the first video, then it is necessary to supplement Zhang San into the director's view of the first session.
There are a number of possible scenarios for step 405.
In a possible case, a first conference terminal acquires a video picture of a first conference site, wherein the video picture of the first conference site comprises a plurality of objects in the first conference site; the first conference terminal intercepts a close-up video picture of the first object from video pictures of the first conference site. In this case, the sharpness of the close-up video of the first object may not meet the setting requirement, e.g., the number of pixels of the close-up video of the first object is smaller than the setting number.
In another possible case, the first conference terminal acquires the position information of the first object at the first conference site, and the first conference terminal acquires the close-up video picture of the first object according to the position information.
In a possible implementation manner in this case, a camera is set on the first conference terminal, and the first conference terminal controls the camera to acquire the close-up video picture according to the position information.
For example, the first conference terminal may control and adjust the pose of the camera according to the position information, so that the camera faces the head of Zhang san (corresponding to the biometric information 1), and aim the focal point at the head of Zhang san (corresponding to the biometric information 1), and take the video shot at this time as a close-up video of the first object. Obviously, in this case, the definition of the close-up video picture of the first object is more likely to meet the setting requirement.
The above implementation of capturing a close-up video picture of the first object may also be applied in step 407. In addition, other implementations of the situation are possible, which will be described in more detail later in step 407.
Step 405 may enable a video stream of a close-up video frame of the first object to be ensured to be obtained all the time within a display duration, thereby improving accuracy of the multicast control.
Step 406 is an optional step, for example, the first conference terminal may also send a video stream of the second video frame to the media server, and then send the video stream of the second video frame to the second conference terminal through the media server.
The implementation steps of step 407 may specifically be as follows:
Step 4071: and the first conference terminal acquires the position information of the first object at the first conference site.
Step 4072: and the first conference terminal acquires the close-up video picture of the first object according to the position information.
Step 4073: the first conference terminal generates the second video picture according to the close-up video picture of the first object.
The second video picture is a guide picture of the first meeting place.
In step 4071, the first conference terminal may acquire the position information of the first object at the first conference site in various manners, which may be specifically as follows:
in one possible design, before the first conference terminal obtains the location information of the first object at the first conference site, the method further includes: and the first conference terminal acquires the position information of each object in the first conference place.
For example, the first conference terminal may acquire a video frame of the first conference site, and perform face recognition on objects in the first conference site, so as to acquire location information of each object in the first conference site. The location information of each object in the first conference place may be updated, and in one possible design, the first conference terminal updates the location information of each object in the first conference place according to a preset period, so as to track the location information of each object in the first conference place in real time.
In step 4071, the object in the close-up video frame of the first object may be the first object only, or may include other objects. For example, when other objects are included, the first object is the object with the largest display area in the close-up video picture.
The specific implementation of step 4071 may be as follows:
in a possible implementation manner, a camera is set on the first conference terminal, and the first conference terminal controls the camera to acquire the close-up video picture according to the position information.
For example, the first conference terminal may control and adjust the pose of the camera according to the position information, so that the camera faces the head of Zhang san (corresponding to the biometric information 1), and aim the focal point at the head of Zhang san (corresponding to the biometric information 1), and take the video shot at this time as a close-up video of the first object.
In the above implementation manner, in the case that the camera is set on the first session terminal, the first session terminal may not be set with the camera, and acquire the close-up video picture of the first object through data transmission with the camera device.
In another possible implementation manner, the first conference terminal may send an image capturing parameter to an image capturing device, where the image capturing parameter is obtained according to the location information, and the close-up video frame is obtained by the image capturing device according to the image capturing parameter; the first conference terminal receives the close-up video picture from the camera device.
For example, the imaging parameters may include pose data, focal length multiple, and the like, and the imaging device may adjust the pose of the imaging device according to the pose data, so that the lens direction of the imaging device faces the first object, and may further amplify the focal length multiple, so that the first object is amplified in the video frame, thereby obtaining a close-up video frame of the first object. In this implementation, the first conference terminal obtains the shooting parameters and sends the shooting parameters to the shooting device, and the shooting device obtains the close-up video picture of the first object.
In another possible implementation manner, the first conference terminal may also send the location information to a camera device, where the location information is used by the camera device to obtain a shooting parameter, and the shooting parameter is used by the camera device to obtain the close-up video picture; the first conference terminal receives the close-up video picture from the camera device.
Steps 4071 to 4072 are optional steps, and the first conference terminal may acquire the close-up video picture of the first object, or may not be based on the position information of the first object. In a possible implementation manner, a first conference terminal acquires a video picture of a first conference site, wherein the video picture of the first conference site comprises all objects in the first conference site; the first conference terminal intercepts a close-up video picture of the first object from video pictures of the first conference site.
The specific implementation of step 4073 may be as follows:
in a possible implementation manner, the second video frame is a close-up video frame of the first object.
In another possible implementation manner, the first conference terminal generates the second video picture according to the first video picture and the close-up video picture. For example, the first video frame and the close-up video frame may be spliced to obtain the second video frame, or the close-up video frame may be embedded in the first video frame.
For example, the first video frame is a close-up video frame of a third video frame, and may be spliced with the characteristic video frame of the third video frame to obtain a second video frame, where the second video frame includes the third video frame and the fourth video frame.
Step 4073 is an optional step and the generation of the second video picture may not be based on the close-up video picture of the first object. For example, the first conference terminal may further obtain a video frame of the first conference site, where the video frame of the first conference site includes all objects of the first conference site, and the second video frame is a video frame of the first conference site.
In step 408, the first conference terminal sends the video stream of the second video frame to the media server, and the media server may send the video stream of the second video frame to the second conference terminal.
Step 408 is an optional step, for example, the first conference terminal may not send the video stream of the second video picture to the second conference terminal through the media server, and the first conference terminal may send the video stream of the second video picture directly to the second conference terminal.
The second conference terminal in step 409 is located at a second conference site, and a fourth video frame includes a second object in the second conference site, where the fourth video frame is a video frame corresponding to a video stream from the second conference terminal.
In one possible scenario, the first conference terminal may also be set as the master terminal, and the second conference terminal as the slave terminal. When the media stream extracts the identifier of the second object, the second conference terminal may acquire the video stream of the close-up video picture of the second object, and the guide picture of the second conference site is switched to the close-up video picture of the second object. It should be noted that, the second conference terminal may extract the identifier of the second object from the media stream by itself, and obtain the video stream of the close-up video frame of the second object. The second conference terminal may also receive the identifier of the second object from the first conference terminal, so that the first conference terminal at the master control end can instruct the second conference terminal to acquire the video stream of the close-up video picture of the second object through the identifier of the second object.
It should be noted that, in step 409, the first conference terminal may also obtain video streams from other conference terminals. Step 409 is only described with respect to the video stream of the fourth video frame from the second conference terminal, and the first conference terminal does not limit the video streams acquired from the other conference terminals. For example, the first conference terminal may also obtain a video stream of a fifth video picture from a third conference terminal, the fifth video picture including a third object, the third object and the third conference terminal being located at a third conference site.
The specific implementation of step 410 may be as follows:
the first conference terminal generates a third video picture according to the video stream of the second video picture and the video stream of a fourth video picture from the second conference terminal, wherein the third video picture comprises the first object and the second object.
For example, the first conference terminal may splice the second video frame and the fourth video frame to obtain the third video frame.
It should be noted that, step 410 only takes the video stream of the second video frame and the video stream of the fourth video frame as an example to generate the third video frame. The generation of the third video picture may also be based on video streams of video pictures of other more conference terminals. For example, the first conference terminal may further generate a third video picture according to the video stream of the second video picture, the video stream of the fourth video picture, and the video stream of the fifth video picture.
The layout of the third video frame is not limited herein. In a possible implementation manner, the third video picture includes a plurality of sub-pictures, where the plurality of sub-pictures includes a first sub-picture and a second sub-picture, and the first sub-picture and the second sub-picture include the first object and the second object, respectively; the sub-pictures are arranged in rows or columns, and the distances between two adjacent sub-pictures are the same.
The distance between the sub-pictures can be set to be the same as the distance between the edges of the sub-pictures and the third video picture, so that the equally dividing effect can be realized. The first sub-picture may be a reduced picture of the second video picture, the second sub-picture may be a reduced picture of the fourth video picture, and the third sub-picture may be a reduced picture of the fifth video picture. As shown in fig. 4A, the display effect of the third video picture is shown taking the example that the sub-pictures are arranged in rows. In fig. 4A, the third video frame includes a first sub-frame, a second sub-frame, and a third sub-frame. In another possible implementation, the distance between the objects in the third video frame may be set to be the same, and the distance between the objects may also be set to be the same as the distance between the objects and the edge of the third video frame. As shown in fig. 4B, taking an example in which objects in the third video picture are arranged in rows, the display effect of the third video picture is shown; wherein, li IV and Zhang III are objects in the first sub-picture, wang Wu and Zhao Liu are objects in the second sub-picture, and red and bright are objects in the third sub-picture.
In the above implementation manner, the third video frames may not be arranged according to rows and columns, or the distance between adjacent sub-frames may not be set, and may specifically be set according to actual requirements.
Steps 409 to 410 are optional steps, and only one case of transmitting a video stream through a server is described. The transmission of video streams through a server can also be divided into a number of scenarios. The method can be concretely as follows:
steps 409 to 410 may correspond to a full exchange mode of the server:
the server acquires the video stream from the first conference terminal, encodes or decodes the video stream from the first conference terminal, and transmits the encoded video stream or the decoded video stream to the second conference terminal. The server acquires the video stream from the second conference terminal, encodes or decodes the video stream from the second conference terminal, and transmits the encoded video stream or the decoded video stream to the first conference terminal. The combined lead frame of the full swap mode may correspond to the third video frame of fig. 1.
In this way, each conference terminal may generate a guide picture of each conference terminal as required according to the set guide policy, for example, after the first conference terminal acquires the video stream from the second conference terminal, a third video picture may be generated. After the second conference terminal receives the video stream of the second video picture, the second video picture can be used as the guide picture of the second conference place as long as the first object exists in the guide picture of the second conference place because the object corresponding to the identification in the media stream is not the object in the second conference place.
It should be noted that, in the above process, only the server forwards the encoded video stream or the decoded video stream is taken as an example, and the server may directly forward the obtained video stream. For example, the server obtains a video stream from the first conference terminal and sends the video stream from the first conference terminal to the second conference terminal.
The server may also be in a full adaptation mode:
the server acquires the video stream of the second video picture and the video stream from the second conference terminal, and generates the video stream of a third video picture according to the video stream of the second video picture and the video stream of a fourth video picture from the second conference terminal; the server sends video streams of the third video frames to the first conference terminal, the second conference terminal and the third conference terminal, which can be shown in fig. 5. The third video picture in fig. 5 is generated by the media server 30, and in particular, the media server 30 may generate the third video picture from the video stream of the second video picture, the video stream of the fourth video picture, and the video stream of the fifth video picture.
The server may also be in a mixed mode, and may forward the acquired video stream to a part of the conference terminals and send the video stream of the third video frame to another part of the conference terminals.
The following describes a method for controlling multicast according to the present application with reference to fig. 6. The multicast control system shown in fig. 6 may include a first conference terminal, a second conference terminal, and a third conference terminal. The first conference terminal is located at a first conference site, the second conference terminal is located at a second conference site, and the third conference terminal is located at a third conference site. The second object is located at the second meeting place, and the third object is located at the third meeting place. The guiding and broadcasting strategies of each conference terminal are as follows: if the obtained identification corresponding object extracted from the media stream is positioned in the conference place where the conference terminal is positioned, the guide picture of the conference place where the conference terminal is positioned comprises the corresponding object of the mentioned identification; if the media stream is not acquired, or the extracted object corresponding to the identifier is not in the conference place where the conference terminal is located, the guide picture of the conference place where the conference terminal is located can be played according to a default mode, and the default mode can be a close-up video picture of the default object, and the like. For example, when the media stream is a voice stream, the guide frame of each conference terminal can be controlled by voice, and when no voice control exists, the conference terminal plays according to a default mode.
Step 601: the first conference terminal sends a media stream to the server.
For example, the media stream is a voice stream, which is obtained by a presenter in the first hall who is four shouts "reddish, bright your opinion". Wherein the second object (reddish) is located at the second meeting place and the third object (bright) is located at the third meeting place.
Step 602: and the server sends the media streams to the second conference terminal and the third conference terminal.
Step 603: the first conference terminal generates a second video picture.
The second video picture comprises the first object (Zhang Sancorresponds to biometric information 1). For example, the second video picture includes Zhang three and Lifour.
Step 604: the second conference terminal generates a fourth video picture.
The fourth video picture includes a second object (reddish). For example, the fourth video picture includes Zhao Liu and reddish.
Step 605: the third conference terminal generates a fifth video picture.
The fifth video picture includes a third object (small light). For example, the fifth video picture includes a small bright and a small bright.
It should be noted that the sequence of steps 602, 603 and 604 is not limited. In addition, the steps 603 and 604 are similar to the substantial process of generating the second video frame by the first conference terminal, and the specific process may refer to the step flow shown in fig. 3B, which is not described herein again.
Step 606: the server acquires the video stream of the second video picture, the video stream of the fourth video picture and the video stream of the fifth video picture.
Step 607: the server generates a video stream of the third video picture.
The third video picture includes a first object, a second object, and a third object. For example, the third video picture includes a reduced picture of the second video picture, a reduced picture of the fourth video picture, and a reduced picture of the fifth video picture.
Step 608: the server sends video streams of the third video pictures to the first conference terminal, the second conference terminal and the third conference terminal.
Steps 607 to 608 correspond to the full adaptation mode of the server, and steps 607 to 608 are optional steps. For example, the server may also employ a full exchange mode, and specific steps may be described with reference to steps 310-311.
The following describes a method for controlling a multicast according to the present application with reference to fig. 7 by taking a prize-awarding scenario as an example. Fig. 7 may be applied to the corresponding multicast control system of fig. 6.
Step 701: the first conference terminal obtains the media stream.
The chairman is located in the first meeting place, and the media stream is the voice stream of "may happy and happy with four (first object), reddish (second object), brightly (third object) to obtain excellent employee rewards".
Step 702: the first conference terminal sends media streams to the second conference terminal and the third conference terminal.
Step 703: the first conference terminal generates a close-up video picture of Lifour.
Step 704: the second conference terminal generates a reddish close-up video picture.
Step 705: the third conference terminal generates a small bright close-up video picture.
It should be noted that the sequence of steps 703, 704 and 705 is not limited. In addition, the steps 704 and 705 are similar to the substantial process of generating the close-up video frame of the first conference terminal, and the specific process may refer to the step flow shown in fig. 3B, which is not described herein.
Step 706: the server obtains the video stream of the close-up video picture of Lifour, the video stream of the close-up video picture of small red and the video stream of the close-up video picture of small bright.
Step 707: the server generates a video stream of combined video pictures.
The combined video frames include a reduced frame of a close-up video frame of Lifour, a reduced frame of a close-up video frame of a reddish, and a reduced frame of a close-up video frame of a bright-up, as shown in FIG. 8.
Step 708: the server transmits the video stream of the combined video picture to the first conference terminal, the second conference terminal and the third conference terminal.
Steps 707 to 708 are steps corresponding to the full adaptation mode of the server, and steps 707 to 708 are optional steps. For example, the server may also employ a full exchange mode, and specific steps may be described with reference to steps 310-311.
The schematic structural diagram of the device for guiding and broadcasting control provided by the embodiment of the application is shown in fig. 9. The device is a first session terminal, the device being located at a first session. The device comprises: an acquisition module 901, configured to acquire a media stream; acquiring an identifier of a first object, wherein the identifier is extracted from the media stream; and a processing module 902, configured to generate a second video frame if the first object is an object in the first conference place and the first video frame does not include the first object, where the second video frame includes the first object, and the first video frame and the second video frame are guide frames of the first conference place.
In one possible design, the identification of the first object is in a stored set of object information, the set of object information comprising information for a plurality of objects, the information for each of the objects comprising the identification of the object.
In one possible design, the first conference site is any one of a plurality of conference sites, the plurality of conference sites are conference sites of a conference to which the first conference terminal is connected, and the object information set includes information of objects in the plurality of conference sites.
In one possible design, the media stream is a video stream and/or a voice stream, the identification of the first object is determined according to an image recognition result and/or a voice recognition result, the image recognition result is obtained by performing image recognition on a video picture corresponding to the video stream, and the voice recognition result is obtained by performing voice recognition on the voice stream.
In one possible design, the acquisition module 901 is further configured to: acquiring a close-up video picture of the first object; the processing module 902 is specifically configured to: and generating the second video picture according to the close-up video picture.
In one possible design, the obtaining module 901 is specifically configured to: acquiring position information of the first object at the first meeting place; and the first conference terminal acquires the close-up video picture according to the position information.
In one possible design, the processing module 902 is further configured to: performing target recognition on the video picture of the first meeting place to obtain an object set in the first meeting place; and determining that the first object is in the first conference place according to the object set in the first conference place.
In one possible design, the processing module 902 is specifically configured to: if the unique object corresponding to the identifier is determined to be the first object and the first object is in the object set in the first meeting place, determining that the first object is in the first meeting place; or if the identification corresponds to a plurality of objects, wherein the objects comprise the first object and the first object is in an object set in the first meeting place, acquiring target biological characteristic information corresponding to the identification; and if the target biological characteristic information is matched with the biological characteristic information of the first object, determining that the first object is in the first meeting place.
In one possible design, the acquisition module 901 is further configured to: acquiring a video stream from a second conference terminal, wherein the second conference terminal is positioned at a second conference site, and a picture of the video stream from the second conference terminal comprises a second object in the second conference site; and generating a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object.
In one possible design, the third video picture includes a plurality of sub-pictures including a first sub-picture and a second sub-picture, the first sub-picture and the second sub-picture including the first object and the second object, respectively; the sub-pictures are arranged in rows or columns, and the distances between two adjacent sub-pictures are the same.
The embodiment of the application also provides an electronic device, which may have a structure as shown in fig. 10, and may be a computer device or a chip system capable of supporting the computer device to implement the method.
The electronic device as shown in fig. 10 may comprise at least one processor 1001, where the at least one processor 1001 is configured to couple to a memory, and read and execute instructions in the memory to implement the steps of the multicast control method provided in the embodiments of the present application. Optionally, the electronic device may further include a communication interface 1002 for supporting the electronic device to receive or transmit signaling or data. A communication interface 1002 in an electronic device may be used to enable interaction with other electronic devices. The processor 1001 may be used to implement an electronic device to perform steps in a method as shown in fig. 3B, 6-7. Optionally, the electronic device may further comprise a memory 1003 having stored therein computer instructions, the memory 1003 may be coupled to the processor 1001 and/or the communication interface 1002 for supporting the processor 1001 to call the computer instructions in the memory 1003 to implement the steps in the method as shown in fig. 3B, 6-7; in addition, the memory 1003 may be used to store data related to embodiments of the methods of the present application, for example, to store data, instructions necessary to support interaction by the communication interface 1002, and/or to store configuration information necessary for the electronic device to perform the methods of the embodiments of the present application.
Embodiments of the present application also provide a computer readable storage medium, where computer instructions are stored, where the computer instructions, when executed by a computer, may cause the computer to perform the method involved in any one of the possible designs of the method embodiments and the method embodiments described above. In the embodiment of the present application, the computer readable storage medium is not limited, and may be, for example, RAM (random-access memory), ROM (read-only memory), or the like.
The present application also provides a chip, which may include a processor coupled to a memory for reading and executing a software program stored in the memory for performing the method involved in any one of the possible implementations of the method embodiments, the method embodiments described above, wherein "coupled" means that the two components are directly or indirectly combined with each other, which combination may be fixed or movable.
The present application also provides a computer program product which, when read and executed by a computer, causes the computer to perform the method as referred to in any one of the possible implementations of the method embodiments described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, optical fiber), or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
The steps of a method or algorithm described in the embodiments of the present application may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software elements may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. In an example, a storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may reside in a terminal device. In the alternative, the processor and the storage medium may reside in different components in a terminal device.
These computer instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (27)

1. A method for controlling a multicast, applied to a first session terminal, where the first session terminal is located in a first session, the method comprising:
the first conference terminal acquires a media stream;
the first conference terminal obtains an identifier of a first object, wherein the identifier is extracted from the media stream;
if the first object is an object in the first meeting place and the first object is not included in the first video picture, the first meeting terminal generates a second video picture, the second video picture includes the first object, and the first video picture and the second video picture are guide pictures of the first meeting place.
2. The method of claim 1, wherein the identification of the first object is in a stored set of object information, the set of object information comprising information for a plurality of objects, the information for each of the objects comprising the identification of the object.
3. The method of claim 2, wherein the first venue is any one of a plurality of venues, the plurality of venues being venues for a meeting to which the first meeting terminal is connected, the set of object information comprising information for objects in the plurality of venues.
4. A method according to any one of claims 1 to 3, wherein the media stream is a video stream and/or a voice stream, and the identification of the first object is determined according to an image recognition result and/or a voice recognition result, the image recognition result is obtained by performing image recognition on a video picture corresponding to the video stream, and the voice recognition result is obtained by performing voice recognition on the voice stream.
5. The method according to any of claims 1 to 4, wherein the first conference terminal generating a second video picture comprises:
the first conference terminal acquires a close-up video picture of the first object;
and the first conference terminal generates the second video picture according to the close-up video picture.
6. The method of claim 5, wherein the method further comprises:
the first conference terminal obtains the position information of the first object at the first conference site;
the first conference terminal obtaining a close-up video picture of the first object includes:
and the first conference terminal acquires the close-up video picture according to the position information.
7. The method of any one of claims 1 to 6, further comprising:
The first conference terminal carries out target identification on the video picture of the first conference site to obtain an object set in the first conference site;
and the first conference terminal determines that the first object is in the first conference place according to the object set in the first conference place.
8. The method of claim 7, wherein the first conference terminal determining that the first object is within the first conference site from a set of objects within the first conference site comprises:
if the first conference terminal determines that the unique object corresponding to the identifier is the first object and the first object is in the object set in the first conference site, determining that the first object is in the first conference site; or alternatively
If the first conference terminal determines that the identifier corresponds to a plurality of objects, wherein the plurality of objects comprise the first object, and the first object is in an object set in the first conference site, acquiring target biological characteristic information corresponding to the identifier; and if the first conference terminal determines that the target biological characteristic information is matched with the biological characteristic information of the first object, determining that the first object is in the first conference site.
9. The method of any one of claims 1 to 8, further comprising:
the first conference terminal acquires a video stream from a second conference terminal, the second conference terminal is positioned at a second conference site, and a picture of the video stream from the second conference terminal comprises a second object in the second conference site;
and the first conference terminal generates a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object.
10. The method of claim 9, wherein the third video picture comprises a plurality of sprites, the plurality of sprites comprising a first sprite and a second sprite, the first sprite and the second sprite comprising the first object and the second object, respectively; the sub-pictures are arranged in rows or columns, and the distances between two adjacent sub-pictures are the same.
11. The broadcasting guiding control system is characterized by comprising a first conference terminal and a second conference terminal, wherein the first conference terminal is positioned at a first conference site, and the second conference terminal is positioned at a second conference site;
The first conference terminal is used for acquiring a media stream;
the first conference terminal is configured to obtain an identifier of a first object, where the identifier is extracted from the media stream; if the first object is an object in the first meeting place and the first object is not included in the first video picture, generating a second video picture, wherein the second video picture comprises the first object, and the first video picture and the second video picture are guide pictures of the first meeting place;
and the second conference terminal is used for acquiring the video stream of the second video picture.
12. The system of claim 11, wherein the system further comprises a server;
the server is configured to obtain a video stream of the second video frame and a video stream from a second conference terminal, where a frame of the video stream from the second conference terminal includes a second object of the second conference site; and generating a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object.
13. The system of claim 11 or 12, wherein the system further comprises a server;
The server is further configured to obtain a video stream from the first conference terminal, encode or decode the video stream from the first conference terminal, and send the encoded video stream or the decoded video stream to the second conference terminal; and/or
The server is further configured to obtain a video stream from the second conference terminal, encode or decode the video stream from the second conference terminal, and send the encoded video stream or the decoded video stream to the first conference terminal.
14. The system of any of claims 11 to 13, wherein the system further comprises a server;
the server is used for acquiring a media stream and extracting the identification of the first object from the media stream; determining that the identification of the first object is in a stored object information set, wherein the object information set comprises information of a plurality of objects, and the information of each object comprises the identification of the object; and sending indication information to the first conference terminal, wherein the indication information indicates that the identification extracted from the media stream is the identification of the first object, and the identification of the first object is in an object information set.
15. The system according to any of claims 11 to 14, wherein the first conference terminal is a master terminal and the second conference terminal is a slave terminal;
the second conference terminal is further configured to obtain an identifier of the media stream and a second object, where the identifier of the second object is extracted from the media stream, and the second object is an object in the second conference room; and acquiring the video stream of the close-up video picture of the second object, and sending the video stream of the close-up video picture of the second object to the first conference terminal.
16. A multicast control device, said device being a first session terminal, said device being located at a first session, said device comprising:
the acquisition module is used for acquiring the media stream; acquiring an identifier of a first object, wherein the identifier is extracted from the media stream;
and the processing module is used for generating a second video picture if the first object is an object in the first meeting place and the first object is not included in the first video picture, wherein the second video picture comprises the first object, and the first video picture and the second video picture are guide pictures of the first meeting place.
17. The apparatus of claim 16, wherein the identification of the first object is in a stored set of object information, the set of object information comprising information for a plurality of objects, the information for each of the objects comprising the identification of the object.
18. The apparatus of claim 17, wherein the first venue is any one of a plurality of venues, the plurality of venues being venues for a meeting to which the first meeting terminal is connected, the set of object information comprising information for objects in the plurality of venues.
19. The apparatus according to any one of claims 16 to 18, wherein the media stream is a video stream and/or a voice stream, and the identification of the first object is determined according to an image recognition result and/or a voice recognition result, the image recognition result is obtained by performing image recognition on a video frame corresponding to the video stream, and the voice recognition result is obtained by performing voice recognition on the voice stream.
20. The apparatus of any of claims 16 to 19, wherein the acquisition module is further to:
acquiring a close-up video picture of the first object;
the processing module is specifically configured to:
And generating the second video picture according to the close-up video picture.
21. The apparatus of claim 20, wherein the acquisition module is specifically configured to:
acquiring position information of the first object at the first meeting place;
and the first conference terminal acquires the close-up video picture according to the position information.
22. The apparatus of any of claims 16 to 21, wherein the processing module is further to:
performing target recognition on the video picture of the first meeting place to obtain an object set in the first meeting place; and determining that the first object is in the first conference place according to the object set in the first conference place.
23. The apparatus of claim 22, wherein the processing module is specifically configured to:
if the unique object corresponding to the identifier is determined to be the first object and the first object is in the object set in the first meeting place, determining that the first object is in the first meeting place; or alternatively
If the identification corresponds to a plurality of objects, wherein the objects comprise the first object and the first object is in an object set in the first meeting place, acquiring target biological characteristic information corresponding to the identification; and if the target biological characteristic information is matched with the biological characteristic information of the first object, determining that the first object is in the first meeting place.
24. The apparatus of any of claims 16 to 23, wherein the acquisition module is further to:
acquiring a video stream from a second conference terminal, wherein the second conference terminal is positioned at a second conference site, and a picture of the video stream from the second conference terminal comprises a second object in the second conference site;
and generating a third video picture according to the video stream of the second video picture and the video stream from the second conference terminal, wherein the third video picture comprises the first object and the second object.
25. The apparatus of claim 24, wherein the third video picture comprises a plurality of sprites, the plurality of sprites comprising a first sprite and a second sprite, the first sprite and the second sprite comprising the first object and the second object, respectively; the sub-pictures are arranged in rows or columns, and the distances between two adjacent sub-pictures are the same.
26. An electronic device, the electronic device comprising: one or more processors; one or more memories; wherein the one or more memories store one or more computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-10.
27. A computer readable storage medium comprising computer instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10.
CN202210979362.4A 2022-08-16 2022-08-16 Guide broadcast control method and device Pending CN117640861A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210979362.4A CN117640861A (en) 2022-08-16 2022-08-16 Guide broadcast control method and device
PCT/CN2023/082003 WO2024036945A1 (en) 2022-08-16 2023-03-16 Broadcast-directing control method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210979362.4A CN117640861A (en) 2022-08-16 2022-08-16 Guide broadcast control method and device

Publications (1)

Publication Number Publication Date
CN117640861A true CN117640861A (en) 2024-03-01

Family

ID=89940520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210979362.4A Pending CN117640861A (en) 2022-08-16 2022-08-16 Guide broadcast control method and device

Country Status (2)

Country Link
CN (1) CN117640861A (en)
WO (1) WO2024036945A1 (en)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5201050B2 (en) * 2009-03-27 2013-06-05 ブラザー工業株式会社 Conference support device, conference support method, conference system, conference support program
JP5223824B2 (en) * 2009-09-15 2013-06-26 コニカミノルタビジネステクノロジーズ株式会社 Image transmission apparatus, image transmission method, and image transmission program
US8917847B2 (en) * 2012-06-12 2014-12-23 Cisco Technology, Inc. Monitoring and notification mechanism for participants in a breakout session in an online meeting
US10468051B2 (en) * 2015-05-09 2019-11-05 Sugarcrm Inc. Meeting assistant
JP7046546B2 (en) * 2017-09-28 2022-04-04 株式会社野村総合研究所 Conference support system and conference support program
US20190251961A1 (en) * 2018-02-15 2019-08-15 Lenovo (Singapore) Pte. Ltd. Transcription of audio communication to identify command to device
JP7152191B2 (en) * 2018-05-30 2022-10-12 シャープ株式会社 Operation support device, operation support system, and operation support method
KR20190095181A (en) * 2019-07-25 2019-08-14 엘지전자 주식회사 Video conference system using artificial intelligence
KR20220061763A (en) * 2020-11-06 2022-05-13 삼성전자주식회사 Electronic device providing video conference and method for providing video conference thereof
CN113596601A (en) * 2021-01-19 2021-11-02 腾讯科技(深圳)有限公司 Video picture positioning method, related device, equipment and storage medium

Also Published As

Publication number Publication date
WO2024036945A1 (en) 2024-02-22

Similar Documents

Publication Publication Date Title
US9641585B2 (en) Automated video editing based on activity in video conference
US9774896B2 (en) Network synchronized camera settings
CN105657329B (en) Video conferencing system, processing unit and video-meeting method
US8154578B2 (en) Multi-camera residential communication system
US8274544B2 (en) Automated videography systems
US8063929B2 (en) Managing scene transitions for video communication
US8253770B2 (en) Residential video communication system
CN109413359B (en) Camera tracking method, device and equipment
US20100245532A1 (en) Automated videography based communications
US11076127B1 (en) System and method for automatically framing conversations in a meeting or a video conference
US20140063176A1 (en) Adjusting video layout
EP2352290A1 (en) Method and apparatus for matching audio and video signals during a videoconference
US20230283888A1 (en) Processing method and electronic device
Wu et al. MoVieUp: Automatic mobile video mashup
CN111246224A (en) Video live broadcast method and video live broadcast system
US20150092013A1 (en) Method and a device for transmitting at least a portion of a signal during a video conference session
JP2022054192A (en) Remote conference system, server, photography device, audio output method, and program
CN117640861A (en) Guide broadcast control method and device
US20220264156A1 (en) Context dependent focus in a video feed
KR101542416B1 (en) Method and apparatus for providing multi angle video broadcasting service
EP4075794A1 (en) Region of interest based adjustment of camera parameters in a teleconferencing environment
JP2017092675A (en) Information processing apparatus, conference system, information processing method, and program
KR20120126101A (en) Method for automatically tagging media content, media server and application server for realizing such a method
US12041347B2 (en) Autonomous video conferencing system with virtual director assistance
US20240119731A1 (en) Video framing based on tracked characteristics of meeting participants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication