WO2019033968A1 - 摄像跟踪方法、装置及设备 - Google Patents

摄像跟踪方法、装置及设备 Download PDF

Info

Publication number
WO2019033968A1
WO2019033968A1 PCT/CN2018/099340 CN2018099340W WO2019033968A1 WO 2019033968 A1 WO2019033968 A1 WO 2019033968A1 CN 2018099340 W CN2018099340 W CN 2018099340W WO 2019033968 A1 WO2019033968 A1 WO 2019033968A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaking
camera
time period
preset time
valid
Prior art date
Application number
PCT/CN2018/099340
Other languages
English (en)
French (fr)
Inventor
郑志伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18846441.6A priority Critical patent/EP3657781B1/en
Publication of WO2019033968A1 publication Critical patent/WO2019033968A1/zh
Priority to US16/791,268 priority patent/US10873666B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/567Multimedia conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/67Focus control based on electronic image sensor signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/142Constructional details of the terminal equipment, e.g. arrangements of the camera and the display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/152Multipoint control units therefor

Definitions

  • the present application relates to the field of camera technologies, and in particular, to a camera tracking method, apparatus, and device.
  • the application of the video conference is more and more extensive.
  • a guide camera is usually set in the conference room, and the video information of the participants in the conference room is obtained in real time through the guide camera, and the obtained video information is transmitted to the video conference.
  • the other party or parties to the meeting are usually set in the conference room, and the video information of the participants in the conference room is obtained in real time through the guide camera, and the obtained video information is transmitted to the video conference.
  • the other party or parties to the meeting is usually set in the conference room, and the video information of the participants in the conference room is obtained in real time through the guide camera, and the obtained video information is transmitted to the video conference.
  • the other party or parties to the meeting is usually set in the conference room, and the video information of the participants in the conference room is obtained in real time through the guide camera, and the obtained video information is transmitted to the video conference.
  • the navigation camera can use the sound source localization technology to automatically switch the navigation lens to the participant who is currently speaking.
  • the sound source detecting device locates the sound source, and adjusts the guiding camera according to the source of the sound, so that the guiding camera can capture the participant who is currently speaking.
  • the camera lens of the navigation camera usually switches, resulting in frequent and unnecessary switching of the camera lens of the navigation camera.
  • the application provides a camera tracking method, device and device, which avoid frequent and unnecessary switching of the camera.
  • the present application provides a camera tracking method, including at least a first camera, a plurality of MICs, a navigation camera, and a navigation device in a navigation system, where the first camera is configured to collect local video, and multiple MICs are used to perform acquisition.
  • the audio information is used to transmit the navigation video stream to other sites, and the navigation device is used to control the navigation state of the navigation camera.
  • the guiding device determines the preset according to the first video information collected by the first camera in the preset time period and the first audio information collected by the plurality of microphones MIC in the preset time period.
  • the historical speech information in the time period is determined according to the second video information collected by the first camera at the current time and the second audio information collected by the plurality of MICs at the current time, and the current speaking object is determined according to the historical speaking information, the current speaking object, and At least one speaker that is photographed by the camera at the current time controls the navigation state of the at least one navigation camera.
  • the guiding device controls the navigation state of the at least one navigation camera according to the historical speech information within the preset preset time period, the current speaking object, and the speech object captured by the at least one navigation camera at the current time.
  • the historical speech information can reflect information such as the speech mode of the participant in the conference site and the importance degree of the participant, the navigation device according to the historical speech information, the current speech object, and the speech object photographed by the at least one navigation camera at the current time.
  • the camera can be more accurately controlled to avoid frequent and unnecessary switching of the camera.
  • the navigation state of the navigation camera may include a camera angle or a focal length
  • the at least one navigation camera includes a first navigation camera and a second navigation camera.
  • the navigation device may control the at least one navigation camera by using a feasible implementation as follows. Broadcast status:
  • the guiding device keeps the imaging angle and the focal length of the first guiding camera unchanged, and the pilot video captured by the first guiding camera is sent to the other time at the current time. Meeting place.
  • the guiding device adjusts the imaging angle or focal length of at least one of the first guiding camera and the second guiding camera according to the historical speaking information.
  • the guiding device determines the speaking mode in the preset time period according to the historical speaking information, and according to the speaking mode in the preset time period, Adjusting an imaging angle or a focal length of at least one of the first guide camera and the second guide camera; wherein the speaking mode includes at least one of a single speaking mode, a double debate mode, and a multi-person discussion mode
  • the guiding device may determine the speaking mode in the preset time period according to the historical speaking information by using a feasible implementation manner as follows:
  • the guiding device determines the number of valid speakers in the preset time period according to the historical speaking information.
  • the guiding device performs each speaking according to the priority of each speaking object in the preset time period and each of the speaking objects in the preset time period.
  • the duration of the speech is determined by determining the number of valid utterances of each uttered object, determining the number of valid utterances greater than or equal to 1 as the effective utterance, and determining the number of valid utterances as the number of valid speakers.
  • the guiding device determines the speaking mode within the preset time period according to the number of valid speakers within the preset time period.
  • the guiding device determines that the speaking mode in the preset time period is a single speaking mode; when the number of valid speakers is 2, if the two active speakers alternately speak, the guiding device determines the preset The speaking mode in the time period is a single speaking mode or a double debate mode; when the number of valid speakers is greater than 2, the guiding device determines the speaking mode in the preset time period according to at least two effective speaking object priorities in the preset time period. People speaking mode or multi-person discussion mode.
  • the guiding device determines that the speaking mode in the preset time period is a single-person speaking mode; if at least two valid speaking objects do not include the important speaking object, the guiding device determines The speaking mode within the preset time period is a multi-person discussion mode.
  • the control process of the first navigation camera or the second navigation camera is different according to different speaking modes, and may include at least three possible implementation manners as follows:
  • the first feasible implementation the speaking mode is the single-person speaking mode.
  • the guiding device determines the target speaking object in the effective speaking object within the preset time period, and adjusts the imaging angle or focal length of the second guiding camera, so that the face image of the target speaking object is located in the second guiding broadcast.
  • the camera's camera target position is the position of the target speaking object in the effective speaking object within the preset time period.
  • the target speaking object can be determined by the following feasible implementation manners:
  • the valid speaking object in the preset time period is determined as the target speaking object; when the number of valid speakers in the preset time period is 2, according to the two valid speaking objects The priority is determined by the target speaking object among the two valid speaking objects; when the number of valid speakers in the preset time period is greater than 2, the important speaking object speaking in the preset time period is determined as the target speaking object.
  • the second feasible implementation is the speaking mode is the double debate mode.
  • the guiding device adjusts the imaging angle or focal length of the second guiding camera so that the face images corresponding to the two effective speaking objects are located in the second guiding camera.
  • the camera target position if the distance between two valid speakers in the preset time period is greater than or equal to the preset distance, the guiding device adjusts the camera angle or focal length of at least one of the first guide camera and the second guide camera, so that The face image corresponding to one of the two effective speaking objects is located at the imaging target position of the first navigation camera, and the face image corresponding to the other effective speaking object is located at the imaging target position of the second navigation camera.
  • the third feasible implementation the speaking mode is a multi-person discussion mode.
  • the guiding device adjusts the imaging angle or the focal length of the second guiding camera, so that the face image corresponding to the at least two valid speaking objects is located in the second The camera target position of the camera; if the distance between at least two valid speakers in the preset time period is less than the preset distance, the guiding device adjusts the camera angle or focal length of the second camera to enable the second camera to capture the panoramic video .
  • the guiding device after the guiding device adjusts the imaging angle or the focal length of the second navigation camera, the guiding device sends the video stream captured by the second guiding camera to the terminal device, so that the terminal device shoots the second guiding camera.
  • the video stream is sent to other sites.
  • the guiding device after the guiding device adjusts an imaging angle or a focal length of at least one of the first guided imaging and the second guided camera, the guiding device sends the video stream captured by the first guiding camera and the second guiding broadcast to the terminal device.
  • the video stream captured by the camera is sent to the terminal device to send the video stream captured by the first navigation camera and the video stream captured by the second navigation camera to other sites.
  • the guiding device determines, according to the first video information collected by the first camera in the preset time period, and the first audio information collected by the plurality of microphones MIC in the preset time period, within the preset time period.
  • Historical speech information including:
  • the guiding device determines the speaking object corresponding to each moment according to the video information and the audio information corresponding to each moment in the preset time period;
  • the broadcast device collects statistics on the uttered objects corresponding to each time, and obtains historical utterance information.
  • the historical utterance information includes at least one of the following information: the number of uttered objects in the preset time period, the duration of each uttered speech, and each speech. The number of times the object is spoken, the content of each speaker's speech, the duration of each speech, the time of each speech, and the priority of each speaker.
  • determining the speaking object corresponding to the first moment according to the video information and the audio information at the first moment including:
  • the guiding device determines a horizontal angle and a vertical angle corresponding to each face image according to the video information at the first moment;
  • the guiding device determines a horizontal angle and a vertical angle corresponding to the sound source at the first moment according to the audio information corresponding to the first moment;
  • the guiding device determines the speaking object corresponding to the first moment according to the horizontal angle and the vertical angle corresponding to each face image, and the horizontal angle and the vertical angle corresponding to the sound source.
  • the guiding device determines a horizontal angle and a vertical angle corresponding to each face image according to the video information at the first moment, including:
  • the guiding device acquires two-dimensional coordinates of each of the face images in the two camera lenses in the binocular camera according to the video information at the first moment;
  • the guiding device determines the depth of each face image according to the distance between the two camera lenses of the binocular camera and the two-dimensional coordinates of each face information in the two camera lenses, and the depth of the face image is the face and The distance between binocular cameras;
  • the guiding device determines the three-dimensional coordinates of each face image in the binocular coordinate system according to the depth of each face image, and the binocular coordinate system is a three-dimensional coordinate system with an imaging lens of the binocular camera as the origin;
  • the guiding device determines the horizontal angle and the vertical angle corresponding to each face image according to the three-dimensional coordinates of each face image in the binocular coordinate system.
  • determining a speaking object corresponding to the first moment according to a horizontal angle and a vertical angle corresponding to each of the personal face images, and a horizontal angle and a vertical angle corresponding to the sound source including:
  • the guiding device determines the distance between the face corresponding to each face image and the sound source according to the horizontal angle and the vertical angle corresponding to each face image, and the horizontal angle and the vertical angle corresponding to the sound source;
  • the guiding device determines the speaking object corresponding to the first moment according to the distance between the face corresponding to each face image and the sound source.
  • the application provides a camera tracking device, including a first determining module, a second determining module, and a control module, where
  • the first determining module is configured to determine the preset time period according to the first video information collected by the first camera in the preset time period and the first audio information collected by the plurality of microphones MIC in the preset time period. Historical speech information within, the first camera is used to collect local video;
  • the second determining module is configured to determine a current speaking object according to the second video information collected by the first camera at the current time and the second audio information collected by the multiple MICs at the current time;
  • the control module is configured to control a navigation state of the at least one navigation camera according to the historical speech information, the current speech object, and a speech object captured by the at least one navigation camera at the current time, where the navigation camera is used by the navigation camera
  • the guide video stream is sent to other sites.
  • the navigation state of the navigation camera includes a camera angle or a focal length
  • the at least one navigation camera includes a first navigation camera and a second navigation camera
  • the control module is specifically configured to:
  • the imaging angle and the focal length of the first navigation camera are kept unchanged, and the pilot video captured by the first navigation camera is The current time is sent to other venues;
  • control module includes a determining unit and an adjusting unit, where
  • the determining unit is configured to determine, according to the historical speaking information, a speaking mode in the preset time period, when the current speaking object and the first guiding camera are different in the current shooting time
  • the speaking mode includes at least one of a single speaking mode, a double debate mode, and a multi-person discussion mode;
  • the adjusting unit is configured to adjust an imaging angle or a focal length of at least one of the first navigation camera and the second navigation camera according to a speaking mode in the preset time period.
  • the determining unit is specifically configured to:
  • the determining unit is specifically configured to:
  • the number of valid speaking objects is determined as the number of valid speakers.
  • the determining unit is specifically configured to:
  • the speaking mode in the preset time period is a single-person speaking mode or a two-person debate mode
  • the determining unit is specifically configured to:
  • the at least two valid speaking objects include an important speaking object, determining that the speaking mode in the preset time period is a single-person speaking mode
  • the important speaking object is not included in the at least two valid speaking objects, determining that the speaking mode in the preset time period is a multi-person discussion mode.
  • the speaking mode is a single-person speaking mode; the adjusting unit is specifically configured to:
  • the adjusting unit is specifically configured to:
  • the important speaking object speaking within the preset time period is determined as the target speaking object.
  • the speaking mode is a double debate mode; the adjusting unit is specifically configured to:
  • the distance between two valid speaking objects in the preset time period is less than a preset distance, adjusting an imaging angle or a focal length of the second navigation camera, so that the face images corresponding to the two valid speaking objects are located An imaging target position of the second navigation camera;
  • Adjusting an imaging angle or a focal length of at least one of the first guide camera and the second guide camera to adjust a distance between two valid speaking objects in the preset time period is greater than or equal to a preset distance
  • the face image corresponding to one of the two effective speaking objects is located at the imaging target position of the first navigation camera, and the face image corresponding to the other effective speaking object is located at the imaging target position of the second navigation camera .
  • the speaking mode is a multi-person discussion mode; the adjusting unit is specifically configured to:
  • the distance between the at least two valid speaking objects in the preset time period is less than the preset distance, adjusting an imaging angle or a focal length of the second navigation camera to make the face corresponding to the at least two valid speaking objects
  • the image is located at an imaging target position of the second navigation camera
  • the distance between the at least two valid speaking objects in the preset time period is less than the preset distance, adjusting the imaging angle or focal length of the second navigation camera to enable the second navigation camera to capture the panoramic video.
  • the device further includes a sending module, where
  • the sending module is configured to: after the adjusting unit adjusts an imaging angle or a focal length of the second navigation camera, send a video stream captured by the second navigation camera to the terminal device, so that the terminal device The video stream captured by the second navigation camera is sent to other sites.
  • the sending module is further configured to:
  • the adjusting unit adjusts an imaging angle or a focal length of at least one of the first navigation camera and the second navigation camera, transmitting, by the terminal device, the video stream captured by the first navigation camera and the second navigation camera
  • the captured video stream is sent to the terminal device to send the video stream captured by the first navigation camera and the video stream captured by the second navigation camera to other sites.
  • the first determining module is specifically configured to:
  • the utterance object corresponding to each time is counted to obtain the historical utterance information
  • the historical utterance information includes at least one of the following: the number of uttered objects in the preset time period, and the utterance duration of each uttered object The number of times each speaker speaks, the content of each speaker's speech, the duration of each speech, the time of each speech, and the priority of each speaker.
  • the first determining module is specifically configured to:
  • the first camera is a binocular camera; the first determining module is specifically configured to:
  • the depth of the face image is human The distance between the face and the binocular camera;
  • the horizontal angle and the vertical angle corresponding to each human face image are determined according to the three-dimensional coordinates of each human face image in the binocular coordinate system.
  • the first determining module is specifically configured to:
  • the application provides a navigation device, including: a processor, a memory, and a communication bus, where the communication bus is used to implement a connection between components, and the memory is used to store program instructions, and the processor uses Reading the program instructions in the memory and performing the method of any of the above aspects according to the program instructions in the memory.
  • the present application provides a computer readable storage medium storing computer execution instructions, where the storage device performs various possible designs described above when at least one processor of the storage device executes the computer to execute an instruction.
  • the camera tracking method provided.
  • the application provides a computer program product comprising computer executed instructions stored in a computer readable storage medium.
  • At least one processor of the storage device can read the computer-executable instructions from the computer-readable storage medium, and the at least one processor executes the computer-executable instructions to cause the storage device to implement the camera tracking method provided by various possible designs in the foregoing method embodiments.
  • the present application provides a chip system including a processor for supporting a navigation device to implement the functions involved in the above aspects, for example, processing information involved in the above method.
  • the chip system further includes a memory for storing program instructions and data necessary for the navigation device.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • the camera tracking method, device and device provided by the present application during the video guiding process, the guiding device controls the historical speaking information in the preset preset time period, the current speaking object and the speaking object shot by the at least one guiding camera at the current time.
  • the navigation state of at least one navigation camera Since the historical speech information can reflect information such as the speech mode of the participant in the conference site and the importance degree of the participant, the navigation device according to the historical speech information, the current speech object, and the speech object photographed by the at least one navigation camera at the current time.
  • the camera can be more accurately controlled to avoid frequent and unnecessary switching of the camera.
  • FIG. 1 is a schematic diagram of an application scenario of a camera tracking method provided by the present application
  • FIG. 2 is a schematic flowchart of a camera tracking method provided by the present application.
  • FIG. 3 is a schematic flowchart of a method for determining a speaking object provided by the present application
  • Figure 4 is a physical coordinate system provided by the present application.
  • FIG. 5 is a schematic flowchart diagram of a method for controlling a navigation camera provided by the present application.
  • FIG. 6A is a schematic diagram 1 of a video screen provided by the present application.
  • 6B is a schematic diagram 2 of a video screen provided by the present application.
  • 6C is a schematic diagram 3 of a video screen provided by the present application.
  • 6D is a schematic diagram 4 of a video screen provided by the present application.
  • 6E is a schematic diagram 5 of a video screen provided by the present application.
  • FIG. 7 is a schematic structural diagram 1 of a camera tracking device provided by the present application.
  • FIG. 8 is a schematic structural diagram 2 of the camera tracking device provided by the present application.
  • FIG. 9 is a schematic structural diagram of a navigation device provided by the present application.
  • FIG. 1 is a schematic diagram of an application scenario of a camera tracking method provided by the present application.
  • a navigation system is set in the local conference site, and the navigation system can track the objects in the shooting site and transmit the captured video stream to other sites in real time.
  • the navigation system includes a camera stand 101, a guide camera 102, a first camera 103, a microphone (Microphone, MIC for short) array 104, a navigation device 105, and a terminal device 106.
  • the navigation camera 102, the first camera 103, and the MIC array 104 are respectively disposed on the imaging bracket.
  • the number of the navigation cameras 102 may be one, three, or the like.
  • the video stream captured by the camera 102 is used for transmission to other sites. It should be noted that when the number of the navigation cameras 102 is multiple, At the same time, it is possible to only partially direct the video stream captured by the camera 102(s) for transmission to other venues.
  • the first camera 103 can be a binocular camera, and the first camera 103 can capture the picture in the entire site.
  • the video of the first camera shooting 103 is only used for local processing, and is not used for sending to other sites.
  • the audio collected by multiple MICs in the MIC array 104 is only used for local processing and is not used for sending to other sites.
  • the number of the navigation cameras included in the navigation system is not specifically limited.
  • the number of MICs included in the MIC array is not specifically limited in this application.
  • the first camera 103 can collect video information in the conference site in real time, and transmit the collected video information to the navigation device 105.
  • the MIC in the video MIC array 104 can collect audio in the conference site in real time, and transmit the collected audio information to the navigation device 105.
  • the guiding device 105 can determine the object that needs to be photographed at the current time according to the obtained video information and audio information, and control the shooting angle or focal length of the guiding camera according to the position of the object that needs to be photographed at the current time.
  • the navigation camera 102 transmits the collected video stream to the navigation device in real time.
  • the guiding device 105 further determines which video stream captured by the navigation camera needs to be sent to other sites, and transmits the determined video stream of the navigation camera to the terminal device 106, so that the terminal device 106 sends the received video stream to the terminal device 106.
  • the terminal device 106 is a video playback device, the terminal device can play the received video stream locally.
  • the guiding device 105 in the process of controlling the navigation camera 102 by the guiding device 105, can determine the current speaking object according to the information collected by the first camera 103 and the MIC array 104 at the current time, and the guiding device 105 is further The information collected by the camera 103 and the MIC array 104 during the historical period determines the historical speech information, and the navigation device 105 controls the navigation camera according to the current speech object and the historical speech information. Since the historical speech information can reflect information such as the participant speaking mode in the conference site and the importance degree of the participant, the guiding device 105 can perform more precise control on the navigation camera according to the current speaking object and the historical speech information. Avoid frequent and unnecessary switching of the camera.
  • FIG. 2 is a schematic flowchart diagram of a camera tracking method provided by the present application. Referring to FIG. 2, the method may include:
  • the guiding device determines the historical speaking information in the preset time period according to the first video information collected by the first camera in the preset time period and the first audio information collected by the multiple MICs in the preset time period. For collecting local videos.
  • the guiding device may refer to the guiding device 105 in the embodiment of FIG. 1.
  • the first camera may refer to the first camera 103 in the embodiment of FIG. 1.
  • the plurality of MICs refer to the MIC array 104 in the embodiment of FIG. It will not be repeated here.
  • the preset time period may be a time period before the current time.
  • the preset time period may be 30 seconds, 1 minute, 5 minutes, 10 minutes, and the like before the current time.
  • the duration of the preset time period is not specifically limited in the present application.
  • the preset time periods corresponding to different moments may also be different.
  • the duration of the preset time period may be shorter, after the conference lasts for a long time, Let the duration of the time period be longer.
  • the duration of the preset time period may be 1 minute, and after 5 minutes of the start of the conference, the duration of the preset time period may be 5 minutes.
  • the duration of the preset time period can be set according to actual needs.
  • the video information and the audio information corresponding to each time in the preset time period may be acquired, and the corresponding target object is determined according to the video information and the audio information corresponding to each time in the preset time period, and each of the corresponding objects is The utterance object corresponding to the time is counted, and the history utterance information is obtained.
  • the historical speaking information may be at least one of the following information: the number of speaking objects in the preset time period, the speaking duration of each speaking object, the content of each speaking object, the number of times of each speaking object, The length of each statement, the time of each statement, and the priority of each speaker.
  • the content included in the historical speech information may be determined according to actual needs, which is not specifically limited in this application.
  • the guiding device determines the current speaking object according to the second video information collected by the first camera at the current time and the second audio information collected by the multiple MICs at the current time.
  • the current speaking object may be determined by the following feasible implementation manners: a horizontal angle and a vertical angle corresponding to each facial image of the video information captured by the first camera at the current time may be acquired, and multiple MICs are acquired at the current time. Audio information, according to the audio information collected by the plurality of MICs at the current time, determine the horizontal angle and the vertical angle corresponding to the sound source at the current time, according to the horizontal angle and the vertical angle corresponding to each face image, and the sound source corresponding The horizontal angle and the vertical angle determine the current speaking object that speaks at the current moment.
  • the guiding device controls the navigation state of the at least one navigation camera according to the historical speech information, the current speech object, and the speech object captured by the at least one navigation camera at the current time, and the navigation camera is configured to send the navigation video stream to the other site.
  • the number of at least one navigation camera may be one or plural.
  • the one guide camera captures in real time, and the video stream captured by the one guide camera is sent to other sites.
  • the target object of the at least one navigation camera at the current time is: the speech object of the one camera that is captured at the current time.
  • the plurality of camera cameras perform real-time shooting.
  • the video stream captured by the camera may be sent to other sites.
  • the video stream captured by the camera is sent to other sites.
  • the at least one camera that is photographed at the current time is the target of the video stream being sent to the conference camera of the other site at the current time.
  • the navigation state of the navigation camera includes a camera angle of the camera or a focal length of the camera.
  • the at least one pilot camera includes the first guide camera and the second guide camera, and at the current moment, the pilot video captured by the first guide camera is sent to other sites, and correspondingly, the following feasible implementation is implemented.
  • the mode controls the navigation state of at least one navigation camera:
  • the guiding device keeps the imaging angle and the focal length of the first guiding camera unchanged.
  • the guiding device can also keep the imaging angle and the focal length of the second navigation camera unchanged.
  • the guiding device may also estimate the next speaking object according to the historical speaking object and the current speaking object, and adjust the imaging angle and the focal length of the second guiding camera according to the position of the next speaking object, so that the estimated next speaking object corresponds to The face image is located at the camera target position of the second guide camera.
  • the camera target position of the second guide camera may be a center position of the photographing lens of the second guide camera, or may be a center-up position of the photographing lens of the second guide camera.
  • the camera target position can be set according to actual needs, which is not specifically limited in this application.
  • the guiding device adjusts the imaging angle or focal length of at least one of the first guiding camera and the second guiding camera according to the historical speaking information.
  • the guiding device can adjust only the imaging angle or the focal length of the second guiding camera, and the guiding device can also adjust the imaging angle and the focal length of the first guiding camera and the second guiding camera at the same time.
  • the guiding device can also Only adjust the camera angle or focal length of the first guide camera.
  • the camera tracking method provided by the present application in the video guiding process, the guiding device controls at least one guiding camera according to the historical speaking information in the determined preset time period, the current speaking object and the speaking object shot by the at least one guiding camera at the current time.
  • the state of the broadcast Since the historical speech information can reflect information such as the speech mode of the participant in the conference site and the importance degree of the participant, the navigation device according to the historical speech information, the current speech object, and the speech object photographed by the at least one navigation camera at the current time.
  • the camera can be more accurately controlled to avoid frequent and unnecessary switching of the camera.
  • the process of determining the historical utterance information it is required to determine the uttered object corresponding to each time in the preset time period
  • S202 it is required to determine the current time corresponding to the uttered object.
  • the current speaker The process of determining the uttered object corresponding to each time in the preset time period is similar to the process of determining the current uttering object corresponding to the current time, and determining the first time (any time or current time in the preset time period)
  • the process of the target object is taken as an example, and the process of determining the target object corresponding to a certain moment is described in detail. Specifically, please refer to the embodiment shown in FIG. 3. It should be noted that, in the embodiment shown in FIG. 3, the first camera is taken as a binocular camera as an example.
  • FIG. 3 is a schematic flowchart of a method for determining a speaking object provided by the present application. Referring to FIG. 3, the method may include:
  • the guiding device determines, according to the video information at the first moment, a horizontal angle and a vertical angle corresponding to each of the personal face images.
  • Figure 4 is a physical coordinate system provided by the present application.
  • the camera bracket includes a horizontal bracket M and a vertical bracket N.
  • the horizontal bracket M and the vertical bracket N are perpendicular to each other, and the intersection between the horizontal M and the vertical bracket N is the midpoint O1 of the horizontal bracket.
  • the midpoint of the horizontal bracket M is O1
  • the midpoint of the vertical bracket N is O2.
  • the first camera disposed on the horizontal bracket M includes an imaging lens A1 and an imaging lens A2, and the imaging lens A1 and the imaging lens A2 are symmetrically disposed with respect to O1.
  • the binocular coordinate system is a three-dimensional coordinate system (shown in the figure), and the binocular coordinate system may use the imaging lens A1 as a coordinate origin or the imaging lens A2 as a coordinate origin.
  • Step A Obtain two-dimensional coordinates of each of the face images in the two camera lenses in the binocular camera.
  • the two-dimensional coordinates of the face image in the camera lens may be: two-dimensional coordinates of the face image in the picture taken by the camera lens.
  • a binocular camera usually has two camera lenses with a certain distance between the two camera lenses such that when the two camera lenses capture the same object, the two-dimensional coordinates of the same object in the two camera lenses are different.
  • the two camera lenses of the binocular camera are respectively the camera lens A1 and the camera lens A2.
  • the image captured by the camera lens A1 on the object P can be as shown in the image P1 in the image P1.
  • the face image of the object P is located on the left side of the image P1
  • the image captured by the image pickup lens A2 on the object P can be as shown in the image P2
  • the face image of the object P is located on the right side of the image P2.
  • the two-dimensional coordinates of the face image of the subject P in the imaging lens A1 and the two-dimensional coordinates in the imaging lens A2 are different.
  • Step B Determine the depth of each face image according to the distance between the two camera lenses of the binocular camera and the two-dimensional coordinates of each face information in the two camera lenses.
  • the depth of the face image is the distance between the face and the binocular camera.
  • the distance between the two camera lenses is the distance between the camera lens A1 and the camera lens A2.
  • the depth of the face image is the distance between the object P and the horizontal support M.
  • a vertical line may be obtained by the object P being perpendicular to the straight line where the horizontal support M is located, and the distance between the object P and the vertical intersection point is the depth s of the face image.
  • step B the depth of the face image shown in step B can be determined according to the formula in the prior art, and details are not described herein again.
  • Step C The guiding device determines the three-dimensional coordinates of each face image in the binocular coordinate system according to the depth of each face image, and the binocular coordinate system is a three-dimensional coordinate system with an imaging lens of the binocular camera as the origin.
  • the binocular coordinate system may be a three-dimensional coordinate system with the imaging lens A1 as a coordinate origin, or a three-dimensional coordinate system with the imaging lens A2 as a coordinate origin.
  • step C can be determined according to the formula in the prior art, and the description is not repeated herein.
  • Step D Determine a horizontal angle and a vertical angle corresponding to each human face image according to the three-dimensional coordinates of each human face image in the binocular coordinate system.
  • the horizontal angle is the angle ⁇ between the line PO1 and the horizontal bracket.
  • the vertical angle is the angle ⁇ between the PO2 and the vertical support.
  • the horizontal angle ⁇ corresponding to the face image may be determined by the following formula 1:
  • (x, y, z) is the three-dimensional coordinate of the face image in the binocular coordinate system
  • dx is the distance between the camera lens and the midpoint of the horizontal support. For example, see Figure 4, where dx is A1 and O1. The distance between them.
  • the vertical angle ⁇ corresponding to the face image may be determined by the following formula 2:
  • dy is half the length of the vertical bracket.
  • dy may be the distance between O1 and O2.
  • the guiding device determines, according to the audio information corresponding to the first moment, a horizontal angle and a vertical angle corresponding to the sound source at the first moment.
  • the audio information obtained by different MICs is different for the same sound source, for example, for the same sound source, the audio information obtained by different MICs is collected.
  • the amplitude or phase is different.
  • S303 Determine a speech object corresponding to the first time according to a horizontal angle and a vertical angle corresponding to each personal face image, and a horizontal angle and a vertical angle corresponding to the sound source.
  • the speaker corresponding to the first moment can be determined by the following feasible implementation manners:
  • the distance between the face and the sound source may be determined by the following formula 3;
  • is the horizontal angle corresponding to the face image
  • is the vertical angle corresponding to the face image
  • ⁇ 1 is the horizontal angle corresponding to the sound source
  • ⁇ 1 is the vertical angle corresponding to the sound source
  • the pilot camera includes a first guide camera and a second guide camera, and the pilot video captured by the first guide camera is sent at the current time.
  • the control process of the navigation camera will be described in detail.
  • the navigation camera does not need to be adjusted.
  • the navigation camera needs to be adjusted. In the embodiment shown in FIG. 5, the adjustment process of the navigation camera is described only when the current speaking object and the first navigation camera are different at the current time.
  • FIG. 5 is a schematic flowchart of a method for controlling a navigation camera provided by the present application. Referring to Figure 5, the method can include:
  • the guiding device determines the effective number of times of each speaking object according to the priority of each speaking object in the preset time period and the speaking duration of each speaking object in the preset time period.
  • each participant in the venue corresponds to a priority.
  • the priority of the participant can be set by the administrator in advance in the navigation device.
  • the manager can determine the priority of each participant based on the identity of the participant in the meeting or the position of the participant, and the information of the participant (such as face information or voice characteristics),
  • the navigation device is imported with the corresponding priority, so that during the conference, the guiding device can determine the priority of the participant according to the collected video information or audio information.
  • the guiding device determines the seat of each participant in the conference according to the video information captured by the first camera, and according to the seat of the participant in the conference, for each participant
  • the participant determines a priority, wherein the seats in the venue have a preset priority. For example, a participant sitting at the center of the venue has the highest priority, and a participant who is farther away from the center of the venue has a lower priority.
  • the pilot device can also update the priority of the participants according to the speech of each participant. For example, when the participants have more speeches or the participants have longer speeches, the priority of the participants can be increased. When the participants have fewer speeches, or the participants have shorter speeches. , the priority of the participant can be lowered. When the participant's speech includes preset keywords (for example, comrades, everyone, cheering, effort, etc.), the priority of the participants can be improved. Of course, in the actual application process, the priority of the participants may be updated according to other contents, which is not specifically limited in this application.
  • different priorities have different valid duration duration thresholds, and the higher the priority, the lower the effective talk duration threshold.
  • the speaking duration of one speech of one speaking object is greater than the effective speech duration threshold corresponding to the priority of the speaking object, it may be determined that the speech is an active speech.
  • the priority of the first uttering object in the preset time period is first acquired, and the priority is obtained.
  • the effective speaking duration threshold the duration of each speech of the first speaking object in the preset time period is also obtained, and the number of speeches whose speech duration is greater than the effective speech duration threshold is obtained, and the number of speeches is the number of valid speeches.
  • the effective speaking duration threshold corresponding to the priority may be set according to actual needs, which is not specifically limited in this application.
  • the guiding device determines, as the valid speaking object, the speaking object whose effective speaking number is greater than or equal to 1.
  • the speech object may be determined as a valid speech object.
  • the guiding device determines the number of valid speaking objects as the number of valid speakers.
  • the guiding device determines whether each of the speaking objects is a valid speaking object, the number of valid speaking objects can be counted to determine the number of valid speakers.
  • the guiding device can determine the number of valid speakers in the preset time period through the S501-S503.
  • the guiding device can determine the number of valid speakers through other feasible implementation manners. No specific limitation.
  • the guiding device determines a speaking mode within a preset time period according to the number of valid speakers in the preset time period.
  • the speaking mode includes at least one of a single speaking mode, a double debate mode, and a multi-person discussion mode.
  • the speaking mode may also include other, such as a three-person debate mode, etc., and the present application does not specifically limit the speaking mode.
  • the method for determining the speaking mode is different, and may include at least three possible implementation manners as follows:
  • the first possible implementation the number of valid speakers is 1.
  • the guiding device may determine that the speaking mode in the preset time period is the single-person speaking mode.
  • the second possible implementation the number of valid speakers is 2.
  • the number of valid speeches when the number of valid speeches is 2, it indicates that there are two valid speaking objects in the preset time period. If the two valid speaking objects are alternately speaking, and the number of alternating times is more, then it can be determined that the two valid speaking objects are in the debate, and then the speaking mode can be determined to be the double debate mode; otherwise, the speaking mode is determined to be a single speech. mode.
  • a third possible implementation the number of valid speakers is greater than 2.
  • the guiding device determines that the speaking mode in the preset time period is a single-person speaking mode or a multi-person discussion mode according to at least two valid speaking object priorities in the preset time period.
  • the guiding device may determine that the speaking mode in the preset time period is the single-person speaking mode. If the important speaking object is not included in the at least two valid speaking objects, the guiding device may determine that the speaking mode in the preset time period is the multi-person discussion mode.
  • the guiding device After the guiding device determines that the speaking mode in the preset time period is obtained, the guiding device adjusts an imaging angle or a focal length of at least one of the first guiding camera and the second guiding camera according to the speaking mode in the preset time period, wherein when the speaking mode is At the same time, the adjustment process for the camera is different.
  • the navigation camera can be adjusted through S505-S507.
  • the navigation camera can be adjusted through S508-S512.
  • the navigation camera can be adjusted through S513-S517.
  • the guiding device determines the target speaking object among the valid speaking objects in the preset time period.
  • the speaking mode in the preset time period may be determined as the single-person speaking mode, and correspondingly, when the speaking mode is determined to be the single-person speaking mode, according to the pre- The number of valid speakers in the time period is different, and the way to determine the target audience is different.
  • at least three possible implementation methods are as follows:
  • the first possible implementation the number of valid speakers in the preset time period is 1.
  • an effective speaking object within a preset time period may be determined as the target speaking object.
  • the number of valid speakers in the preset time period is 2.
  • the target speaking object can be determined among the two valid speaking objects according to the priority of the two valid speaking objects. For example, among the two valid speaking objects, the valid speaking object having the higher priority may be determined as the target speaking object.
  • a third possible implementation the number of valid speakers in the preset time period is greater than 2.
  • the important speaking object may be determined according to the priority of the effective speaking object within the preset time period, and the important speaking object is determined as the target speaking object.
  • the target speaking object may also be determined according to other feasible implementation manners, which is not specifically limited in this application.
  • the guiding device adjusts an imaging angle or a focal length of the second guiding camera, so that the face image of the target speaking object is located at the imaging target position of the second guiding camera.
  • the second guiding camera may be adjusted to adjust the second guiding camera.
  • the video stream captured by the first navigation camera is still transmitted to other sites.
  • any one of the navigation cameras can be adjusted at this time, or the adjustment range is small. The camera is adjusted.
  • the camera angle or focal length of the second navigation camera can be adjusted by the following steps A-step D:
  • Step A The guiding device acquires three-dimensional coordinates of the target speaking object in the binocular coordinate system.
  • step A can be referred to step A-step C in S301, and details are not described herein.
  • Step B The guiding device determines three-dimensional coordinates of the target speaking object in the guiding coordinate system according to external parameters between the binocular camera (first camera) and the second guiding camera.
  • the guidance coordinate system refers to a three-dimensional coordinate system with the initial position of the second navigation camera as the origin.
  • external parameters between the binocular camera and the second guide camera include a distance between the binocular camera and the second guide camera, and the like.
  • Step C The guiding device determines the two-dimensional coordinates of the target speaking object in the imaging lens of the second guiding camera according to the three-dimensional coordinates of the target speaking object in the guiding coordinate system.
  • Step D calculating a focal length of the second navigation camera according to a distance between the target speaking object and the second navigation camera, and a two-dimensional coordinate of the target speaking object in the imaging lens of the second guiding camera and a target position to be reached, and
  • the second guide camera requires a horizontal angle and a vertical angle of rotation.
  • FIG. 6A is a schematic diagram 1 of a video screen provided by the present application.
  • a video object includes a speech object, and the one valid speech object is located at a target position of the screen, for example, the target position is a center position of the screen.
  • the guiding device sends the video stream captured by the second navigation camera to the terminal device, so that the terminal device sends the video stream captured by the second navigation camera to the other site.
  • the terminal device After the navigation device sends the video stream captured by the second navigation camera to the terminal device, the terminal device sends the received video stream to other sites.
  • the guiding device adjusts an imaging angle or a focal length of the second navigation camera, so that the facial image corresponding to the two effective speaking objects is located at the imaging target position of the second navigation camera.
  • FIG. 6B is a schematic diagram 2 of a video screen provided by the present application. Referring to FIG. 6B, two valid speaking objects are included in the video picture, and the two valid speaking objects are located at the center of the screen.
  • the guiding device sends the video stream captured by the second navigation camera to the terminal device, so that the terminal device sends the video stream captured by the second navigation camera to the other site.
  • the guiding device adjusts an imaging angle or a focal length of at least one of the first navigation camera and the second navigation camera, so that the face image corresponding to one of the two valid speaking objects is located at the imaging target position of the first navigation camera.
  • the face image corresponding to another valid speaking object is located at the imaging target position of the second navigation camera.
  • the guiding device sends the video stream captured by the first navigation camera and the video stream captured by the second navigation camera to the terminal device, so that the terminal device sends the video stream captured by the first navigation camera and the video stream captured by the second navigation camera to the terminal device.
  • Other venues include:
  • the terminal device After the terminal device receives the video stream captured by the first navigation camera and the video stream captured by the second navigation camera, the terminal device sends the video stream captured by the first navigation camera and the video stream captured by the second navigation camera to other conference sites. So that other sites can simultaneously play the video stream captured by the first guide camera and the second guide camera.
  • the video stream captured by the first navigation camera and the video stream captured by the second navigation camera may also be used. Combine and send the combined video stream to the terminal device.
  • the video streams captured by the second navigation camera and the second navigation camera may be combined before being sent to other sites. Send the combined video stream.
  • FIG. 6C is a schematic diagram 3 of a video screen provided by the present application.
  • the video picture includes two valid speaking objects, which are respectively captured by the first guiding camera and the second guiding camera, and the pictures corresponding to the two effective speaking objects are combined by the terminal device. together.
  • the guiding device determines that a distance between the at least two valid speaking objects is less than a preset distance is less than a preset distance.
  • the guiding device may determine the distance between the two active speaking objects that are farthest apart as the distance between the at least two valid speaking objects.
  • the guiding device adjusts an imaging angle or a focal length of the second navigation camera, so that the facial image corresponding to the at least two valid speaking objects is located at the imaging target position of the second navigation camera.
  • FIG. 6D is a schematic diagram 4 of a video screen provided by the present application. Referring to FIG. 6D, assuming that the number of at least two speaking objects is three, three valid speaking objects are included in the video picture, and the two valid speaking objects are located at the center of the screen.
  • the guiding device sends the video stream captured by the second navigation camera to the terminal device, so that the terminal device sends the video stream captured by the second navigation camera to other sites.
  • the guiding device adjusts an imaging angle or a focal length of the second guiding camera to enable the second guiding camera to capture the panoramic video.
  • FIG. 6E is a schematic diagram 5 of a video screen provided by the present application.
  • the video surface captured by the second navigation camera is a panoramic image, and the panoramic image includes face images of all participants in the conference site.
  • the guiding device sends the video stream captured by the second navigation camera to the terminal device, so that the terminal device sends the video stream captured by the second navigation camera to the other site.
  • the guiding device determines the speaking mode according to the historical speaking information in the preset time period, and controls the navigation camera according to the speaking mode.
  • the speaking mode in the preset time period can reflect the real scene of the meeting, and the guiding camera can be accurately controlled according to the real scene of the meeting, thereby avoiding frequent and unnecessary switching of the guiding camera.
  • FIG. 7 is a schematic structural diagram 1 of the camera tracking device provided by the present application.
  • the camera tracking device can be disposed in the navigation device shown in the embodiment of FIG. 1.
  • the apparatus may include a first determining module 11, a second determining module 12, and a control module 13, wherein
  • the first determining module 11 is configured to determine the preset according to the first video information collected by the first camera in a preset time period and the first audio information collected by the plurality of microphones MIC in the preset time period. Historical speech information during the time period, the first camera is used to collect local video;
  • the second determining module 12 is configured to determine a current speaking object according to the second video information collected by the first camera at the current time and the second audio information collected by the multiple MICs at the current time;
  • the control module 13 is configured to control a navigation state of the at least one navigation camera according to the historical speech information, the current speech object, and a speech object captured by the at least one navigation camera at the current time, the navigation camera Used to send a guide video stream to other sites.
  • the camera tracking device provided by the present application can perform the technical solutions shown in the foregoing method embodiments, and the implementation principle and the beneficial effects are similar, and details are not described herein.
  • the navigation state of the navigation camera includes a camera angle or a focal length
  • the at least one navigation camera includes a first navigation camera and a second navigation camera
  • the control module 13 is specifically configured to:
  • the imaging angle and the focal length of the first navigation camera are kept unchanged, and the pilot video captured by the first navigation camera is The current time is sent to other venues;
  • FIG. 8 is a schematic structural diagram 2 of the camera tracking device provided by the present application.
  • the control module 13 includes a determining unit 131 and an adjusting unit 132, where
  • the determining unit 131 is configured to determine, according to the historical speaking information, a speaking mode in the preset time period, when the current speaking object and the first guiding camera are different in the current shooting time,
  • the speaking mode includes at least one of a single-person speaking mode, a two-person debate mode, and a multi-person discussion mode;
  • the adjusting unit 132 is configured to adjust an imaging angle or a focal length of at least one of the first navigation camera and the second navigation camera according to a speaking mode in the preset time period.
  • the determining unit 131 is specifically configured to:
  • the determining unit 131 is specifically configured to:
  • the number of valid speaking objects is determined as the number of valid speakers.
  • the determining unit 131 is specifically configured to:
  • the speaking mode in the preset time period is a single-person speaking mode or a two-person debate mode
  • the determining unit 131 is specifically configured to:
  • the at least two valid speaking objects include an important speaking object, determining that the speaking mode in the preset time period is a single-person speaking mode
  • the important speaking object is not included in the at least two valid speaking objects, determining that the speaking mode in the preset time period is a multi-person discussion mode.
  • the speaking mode is a single-person speaking mode; the adjusting unit 132 is specifically configured to:
  • the adjusting unit 132 is specifically configured to:
  • the important speaking object speaking within the preset time period is determined as the target speaking object.
  • the speaking mode is a double debate mode; the adjusting unit 132 is specifically configured to:
  • the distance between two valid speaking objects in the preset time period is less than a preset distance, adjusting an imaging angle or a focal length of the second navigation camera, so that the face images corresponding to the two valid speaking objects are located An imaging target position of the second navigation camera;
  • Adjusting an imaging angle or a focal length of at least one of the first guide camera and the second guide camera to adjust a distance between two valid speaking objects in the preset time period is greater than or equal to a preset distance
  • the face image corresponding to one of the two effective speaking objects is located at the imaging target position of the first navigation camera, and the face image corresponding to the other effective speaking object is located at the imaging target position of the second navigation camera .
  • the speaking mode is a multi-person discussion mode; the adjusting unit 132 is specifically configured to:
  • the distance between the at least two valid speaking objects in the preset time period is less than the preset distance, adjusting an imaging angle or a focal length of the second navigation camera to make the face corresponding to the at least two valid speaking objects
  • the image is located at an imaging target position of the second navigation camera
  • the distance between the at least two valid speaking objects in the preset time period is less than the preset distance, adjusting the imaging angle or focal length of the second navigation camera to enable the second navigation camera to capture the panoramic video.
  • the device further includes a sending module 14, wherein
  • the sending module 14 is configured to: after the adjusting unit 132 adjusts an imaging angle or a focal length of the second navigation camera, send a video stream captured by the second navigation camera to the terminal device, so that the terminal device The video stream captured by the second navigation camera is sent to other sites.
  • the sending module 14 is further configured to:
  • the adjusting unit 132 adjusts an imaging angle or a focal length of at least one of the first navigation camera and the second navigation camera, transmitting, by the first navigation camera, the video stream and the second guide to the terminal device. a video stream captured by the camera, so that the terminal device sends the video stream captured by the first navigation camera and the video stream captured by the second navigation camera to other sites.
  • the first determining module 11 is specifically configured to:
  • the utterance object corresponding to each time is counted to obtain the historical utterance information
  • the historical utterance information includes at least one of the following: the number of uttered objects in the preset time period, and the utterance duration of each uttered object The number of times each speaker speaks, the content of each speaker's speech, the duration of each speech, the time of each speech, and the priority of each speaker.
  • the first determining module 11 is specifically configured to:
  • the first camera is a binocular camera; the first determining module 11 is specifically configured to:
  • the depth of the face image is human The distance between the face and the binocular camera;
  • the horizontal angle and the vertical angle corresponding to each human face image are determined according to the three-dimensional coordinates of each human face image in the binocular coordinate system.
  • the first determining module 11 is specifically configured to:
  • the camera tracking device provided by the present application can perform the technical solutions shown in the foregoing method embodiments, and the implementation principles and beneficial effects thereof are similar, and details are not described herein.
  • FIG. 9 is a schematic structural diagram of a navigation device provided by the present application.
  • the navigation device includes a processor 21, a memory 22 for implementing a connection between components, and a communication bus 23 for storing program instructions.
  • the device 21 is configured to read a program instruction in the memory 22, and execute the technical solution shown in the foregoing method embodiment according to the program instruction in the memory 22.
  • the guiding device provided by the present application can perform the technical solutions shown in the foregoing method embodiments, and the implementation principles and beneficial effects thereof are similar, and details are not described herein.
  • the present application provides a computer readable storage medium having computer executed instructions stored therein, and when at least one processor of a storage device executes the computer to execute an instruction, the storage device performs the camera tracking provided by the various possible designs described above. method.
  • the application provides a computer program product comprising computer executed instructions stored in a computer readable storage medium.
  • At least one processor of the storage device can read the computer-executable instructions from the computer readable storage medium, and the at least one processor executes the computer-executable instructions to cause the storage device to implement the camera tracking method provided by various possible designs in the foregoing method embodiments.
  • the present application provides a chip system including a processor for supporting a navigation device to implement the functions involved in the above aspects, for example, processing information involved in the above method.
  • the chip system further includes a memory for storing program instructions and data necessary for the navigation device.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Studio Devices (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请提供一种摄像跟踪方法、装置及设备,该方法包括:导播设备根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在预设时段内采集的第一音频信息,确定预设时段内的历史发言信息,第一摄像机用于采集本地视频;导播设备根据第一摄像机在当前时刻采集的第二视频信息和多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象;导播设备根据历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,控制至少一个导播摄像机的导播状态,导播摄像机用于向其它会场发送导播视频流。避免了导播摄像机进行频繁及不必要的切换。

Description

摄像跟踪方法、装置及设备
本申请要求于2017年8月16日提交中国专利局、申请号为、发明名称为“摄像跟踪方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及摄像技术领域,尤其涉及一种摄像跟踪方法、装置及设备。
背景技术
目前,视频会议的应用越来越广泛,在视频会议场景中,在会议室通常设置有导播摄像机,通过导播摄像机实时获取会议室中参会人员的视频信息,并将获取得到的视频信息传输给会议的另一方或多方。
在现有技术中,导播摄像机可以利用声源定位技术,自动将导播镜头切换至当前正在发言的参会人员。具体的,通过声源检测设备定位声音来源,并根据声音的来源对导播摄像机进行调节,使得导播摄像机可以拍摄当前正在发言的参会人员。然而,在现有技术中,当会议室中讲话的人发生变化,导播摄像机的摄像镜头通常会进行切换,导致导播摄像机的摄像镜头进行频繁以及不必要的切换。
发明内容
本申请提供一种摄像跟踪方法、装置及设备,避免了导播摄像机进行频繁及不必要的切换。
第一方面,本申请提供一种摄像跟踪方法,在导播系统中至少包括第一摄像机、多个MIC、导播摄像机和导播设备,第一摄像机用于实施采集本地视频,多个MIC用于实施采集音频信息,导播摄像机用于向其它会场发送导播视频流,导播设备用于对导播摄像机的导播状态进行控制。当导播设备需要控制导播摄像机的导播状态时,导播设备根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在预设时段内采集的第一音频信息,确定预设时段内的历史发言信息,根据第一摄像机在当前时刻采集的第二视频信息和多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象,并根据历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,控制至少一个导播摄像机的导播状态。
在本申请中,在视频导播过程中,导播设备根据确定得到的预设时段内的历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,控制至少一个导播摄像机的导播状态。由于历史发言信息可以反映会场中的参会者发言模式、以及参会者的重要程度等信息,因此,根据历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,导播设备可以对导播摄像机进行更为精确的控制,避免了导播摄像机进行频繁及不必要的切换。
可选的,导播摄像机的导播状态可以包括摄像角度或焦距,至少一个导播摄像机中包 括第一导播摄像机和第二导播摄像机;相应的,导播设备可以通过如下可行的实现方式控制至少一个导播摄像机的导播状态:
当当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象相同时,导播设备保持第一导播摄像机的摄像角度和焦距不变,第一导播摄像机拍摄的导频视频在当前时刻被发送至其它会场。
当当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象不同时,导播设备根据历史发言信息,调整第一导播摄像机和第二导播摄像机中的至少一个的摄像角度或焦距。
可选的,当当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象不同时,导播设备根据历史发言信息,确定在预设时段内的发言模式,并根据预设时段内的发言模式,调整第一导播摄像机和第二导播摄像机中的至少一个的摄像角度或焦距;其中,发言模式包括单人发言模式、双人辩论模式和多人讨论模式中的至少一种
在另一种可能的实现方式中,导播设备可以通过如下可行的实现方式根据历史发言信息,确定在预设时段内的发言模式:
导播设备根据历史发言信息,确定预设时段内的有效发言人数;可选的,导播设备根据预设时段内每一个发言对象的优先级、及每一个发言对象的在预设时段内每次发言的发言时长,确定每一个发言对象的有效发言次数,将有效发言次数大于或等于1的发言对象确定为有效发言对象,并将有效发言对象的个数确定为有效发言人数。
导播设备根据预设时段内的有效发言人数,确定在预设时段内的发言模式。可选的,当有效发言人数为1时,导播设备确定预设时段内的发言模式为单人发言模式;当有效发言人数为2时,若两个有效发言人交替发言,导播设备确定预设时段内的发言模式为单人发言模式或双人辩论模式;当有效发言人数大于2时,导播设备根据预设时段内的至少两个有效发言对象优先级,确定预设时段内的发言模式为单人发言模式或多人讨论模式。
可选的,若至少两个有效发言对象中包括重要发言对象,导播设备确定预设时段内的发言模式为单人发言模式;若至少两个有效发言对象中不包括重要发言对象,导播设备确定预设时段内的发言模式为多人讨论模式。
根据发言模式的不同,导播设备对第一导播摄像机或第二导播摄像机的控制过程也不同,可以包括至少如下三种可行的实现方式:
第一种可行的实现方式:发言模式为单人发言模式。
在该种可行的实现方式中,导播设备在预设时段内有效发言对象中确定目标发言对象,并调节第二导播摄像机的摄像角度或焦距,以使目标发言对象的人脸图像位于第二导播摄像机的摄像目标位置。
可选的,可以通过如下可行的实现方式确定目标发言对象:
当预设时段内的有效发言人数为1时,则将预设时段内的一个有效发言对象确定为目标发言对象;当预设时段内的有效发言人数为2时,则根据两个有效发言对象的优先级在两个有效发言对象中确定目标发言对象;当预设时段内的有效发言人数大于2时,则将在预设时段内发言的重要发言对象确定为目标发言对象。
第二种可行的实现方式:发言模式为双人辩论模式。
若预设时段内的两个有效发言对象之间的距离小于预设距离,导播设备调整第二导播 摄像机的摄像角度或焦距,以使两个有效发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置;若预设时段内的两个有效发言对象之间的距离大于或等于预设距离,导播设备调整第一导播摄像和第二导播摄像机中至少一个的摄像角度或焦距,以使两个有效发言对象中的一个有效发言对象对应的人脸图像位于第一导播摄像机的摄像目标位置、另一个有效发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置。
第三种可行的实现方式:发言模式为多人讨论模式。
若预设时段内的至少两个有效发言对象之间的距离小于预设距离,导播设备调整第二导播摄像机的摄像角度或焦距,以使至少两个有效发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置;若预设时段内的至少两个有效发言对象之间的距离小于预设距离,导播设备调整第二导播摄像机的摄像角度或焦距,以使第二导播摄像机拍摄全景视频。
在上述任意一种实现方式中,在导播设备调整第二导播摄像机的摄像角度或焦距之后,导播设备向终端设备发送第二导播摄像机拍摄的视频流,以使终端设备将第二导播摄像机拍摄的视频流发送至其它会场。
在上述任意一种实现方式中,在导播设备调整第一导播摄像和第二导播摄像机中至少一个的摄像角度或焦距之后,导播设备向终端设备发送第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流,以使终端设备将第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流发送至其它会场。
在另一种可能的实现方式中,导播设备根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在预设时段内采集的第一音频信息,确定预设时段内的历史发言信息,包括:
导播设备根据预设时段内各时刻对应的视频信息和音频信息,确定各时刻对应的发言对象;
导播设备对各时刻对应的发言对象进行统计,得到历史发言信息,历史发言信息包括如下信息中的至少一种:预设时段内的发言对象个数、每一个发言对象的发言时长、每一个发言对象的发言次数、每一个发言对象的发言内容、每一次发言的发言时长、每一次发言的发言时刻、和每一个发言对象的优先级。
可选的,针对预设时段内的第一时刻,根据第一时刻的视频信息和音频信息,确定第一时刻对应的发言对象,包括:
导播设备根据第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度;
导播设备根据第一时刻对应的音频信息,确定在第一时刻时的声源对应的水平角度和垂直角度;
导播设备根据每一个人脸图像对应的水平角度和垂直角度、及声源对应的水平角度和垂直角度,确定第一时刻对应的发言对象。
可选的,当第一摄像机为双目摄像机时,导播设备根据第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度,包括:
导播设备根据第一时刻的视频信息,获取每一个人脸图像在双目摄像机中两个摄像镜头中的二维坐标;
导播设备根据双目摄像机的两个摄像镜头之间的距离、及每一个人脸信息在两个摄像 镜头中的二维坐标,确定每一个人脸图像的深度,人脸图像的深度为人脸与双目摄像机之间的距离;
导播设备根据每一个人脸图像的深度,确定每一个人脸图像在双目坐标系中的三维坐标,双目坐标系为以双目摄像机的一个摄像镜头为原点的三维坐标系;
导播设备根据每一个人脸图像在双目坐标系中的三维坐标,确定每一个人脸图像对应的水平角度和垂直角度。
可选的,根据每一个人脸图像对应的水平角度和垂直角度、及声源对应的水平角度和垂直角度,确定第一时刻对应的发言对象,包括:
导播设备根据每一个人脸图像对应的水平角度和垂直角度、及声源对应的水平角度和垂直角度,确定每一个人脸图像对应的人脸与声源之间的距离;
导播设备根据每一个人脸图像对应的人脸与声源之间的距离,确定第一时刻对应的发言对象。
第二方面,本申请提供一种摄像跟踪装置,包括第一确定模块、第二确定模块和控制模块,其中,
所述第一确定模块用于,根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在所述预设时段内采集的第一音频信息,确定所述预设时段内的历史发言信息,所述第一摄像机用于采集本地视频;
所述第二确定模块用于,根据所述第一摄像机在当前时刻采集的第二视频信息和所述多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象;
所述控制模块用于,根据所述历史发言信息、所述当前发言对象和至少一个导播摄像机在所述当前时刻拍摄的发言对象,控制所述至少一个导播摄像机的导播状态,所述导播摄像机用于向其它会场发送导播视频流。
在一种可能的实现方式中,导播摄像机的导播状态包括摄像角度或焦距,所述至少一个导播摄像机中包括第一导播摄像机和第二导播摄像机;所述控制模块具体用于:
当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象相同时,保持所述第一导播摄像机的摄像角度和焦距不变,所述第一导播摄像机拍摄的导频视频在当前时刻被发送至其它会场;
当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,根据所述历史发言信息,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
在另一种可能的实现方式中,所述控制模块包括确定单元和调整单元,其中,
所述确定单元用于,在所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,根据所述历史发言信息,确定在所述预设时段内的发言模式,所述发言模式包括单人发言模式、双人辩论模式和多人讨论模式中的至少一种;
所述调整单元用于,根据所述预设时段内的发言模式,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
在另一种可能的实现方式中,所述确定单元具体用于:
根据所述历史发言信息,确定所述预设时段内的有效发言人数;
根据所述预设时段内的有效发言人数,确定在所述预设时段内的发言模式。
在另一种可能的实现方式中,所述确定单元具体用于:
根据所述预设时段内每一个发言对象的优先级、及每一个发言对象的在所述预设时段内每次发言的发言时长,确定每一个发言对象的有效发言次数;
将有效发言次数大于或等于1的发言对象确定为有效发言对象;
将所述有效发言对象的个数确定为所述有效发言人数。
在另一种可能的实现方式中,所述确定单元具体用于:
当所述有效发言人数为1时,确定所述预设时段内的发言模式为单人发言模式;
当所述有效发言人数为2时,若所述两个有效发言人交替发言,确定所述预设时段内的发言模式为单人发言模式或双人辩论模式;
当所述有效发言人数大于2时,根据所述预设时段内的至少两个有效发言对象优先级,确定所述预设时段内的发言模式为单人发言模式或多人讨论模式。
在另一种可能的实现方式中,所述确定单元具体用于:
若所述至少两个有效发言对象中包括重要发言对象,确定所述预设时段内的发言模式为单人发言模式;
若所述至少两个有效发言对象中不包括重要发言对象,确定所述预设时段内的发言模式为多人讨论模式。
在另一种可能的实现方式中,所述发言模式为单人发言模式;所述调整单元具体用于:
在所述预设时段内有效发言对象中确定目标发言对象;
调节所述第二导播摄像机的摄像角度或焦距,以使所述目标发言对象的人脸图像位于所述第二导播摄像机的摄像目标位置。
在另一种可能的实现方式中,所述调整单元具体用于:
当所述预设时段内的有效发言人数为1时,则将所述预设时段内的一个有效发言对象确定为所述目标发言对象;
当所述预设时段内的有效发言人数为2时,则根据所述两个有效发言对象的优先级在所述两个有效发言对象中确定目标发言对象;
当所述预设时段内的有效发言人数大于2时,则将在所述预设时段内发言的重要发言对象确定为所述目标发言对象。
在另一种可能的实现方式中,所述发言模式为双人辩论模式;所述调整单元具体用于:
若所述预设时段内的两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
若所述预设时段内的两个有效发言对象之间的距离大于或等于预设距离,调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距,以使所述两个有效发言对象中的一个有效发言对象对应的人脸图像位于所述第一导播摄像机的摄像目标位置、另一个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置。
在另一种可能的实现方式中,所述发言模式为多人讨论模式;所述调整单元具体用于:
若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,调整所述第二 导播摄像机的摄像角度或焦距,以使所述至少两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述第二导播摄像机拍摄全景视频。
在另一种可能的实现方式中,所述装置还包括发送模块,其中,
所述发送模块用于,在所述调整单元调整所述第二导播摄像机的摄像角度或焦距之后,向终端设备发送所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第二导播摄像机拍摄的视频流发送至其它会场。
在另一种可能的实现方式中,所述发送模块还用于:
在所述调整单元调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距之后,向终端设备发送所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流发送至其它会场。
在另一种可能的实现方式中,所述第一确定模块具体用于:
根据所述预设时段内各时刻对应的视频信息和音频信息,确定各时刻对应的发言对象;
对各时刻对应的发言对象进行统计,得到所述历史发言信息,所述历史发言信息包括如下信息中的至少一种:所述预设时段内的发言对象个数、每一个发言对象的发言时长、每一个发言对象的发言次数、每一个发言对象的发言内容、每一次发言的发言时长、每一次发言的发言时刻、和每一个发言对象的优先级。
在另一种可能的实现方式中,针对所述预设时段内的第一时刻,所述第一确定模块具体用于:
根据所述第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度;
根据所述第一时刻对应的音频信息,确定在所述第一时刻时的声源对应的水平角度和垂直角度;
根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定所述第一时刻对应的发言对象。
在另一种可能的实现方式中,所述第一摄像机为双目摄像机;所述第一确定模块具体用于:
根据所述第一时刻的视频信息,获取每一个人脸图像在所述双目摄像机中两个摄像镜头中的二维坐标;
根据所述双目摄像机的两个摄像镜头之间的距离、及每一个人脸信息在所述两个摄像镜头中的二维坐标,确定每一个人脸图像的深度,人脸图像的深度为人脸与所述双目摄像机之间的距离;
根据每一个人脸图像的深度,确定每一个人脸图像在双目坐标系中的三维坐标,所述双目坐标系为以所述双目摄像机的一个摄像镜头为原点的三维坐标系;
根据每一个人脸图像在所述双目坐标系中的三维坐标,确定每一个人脸图像对应的水平角度和垂直角度。
在另一种可能的实现方式中,所述第一确定模块具体用于:
根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定每一个人脸图像对应的人脸与所述声源之间的距离;
根据每一个人脸图像对应的人脸与所述声源之间的距离,确定所述第一时刻对应的发言对象。
第三方面,本申请提供一种导播设备,包括:处理器、存储器及通信总线,所述通信总线用于实现各元器件之间的连接,所述存储器用于存储程序指令,所述处理器用于读取所述存储器中的程序指令,并根据所述存储器中的程序指令执行上述第一方面任一项所述的方法。
第四方面,本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当存储设备的至少一个处理器执行该计算机执行指令时,存储设备执行上述各种可能设计提供的摄像跟踪方法。
第五方面,本申请提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中。存储设备的至少一个处理器可以从计算机可读存储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令使得存储设备实施前述方法实施例中的各种可能设计提供的摄像跟踪方法。
第六方面,本申请提供一种芯片系统,该芯片系统包括处理器,用于支持导播设备实现上述方面中所涉及的功能,例如,处理上述方法中所涉及的信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存导播设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其它分立器件。
本申请提供的摄像跟踪方法、装置及设备,在视频导播过程中,导播设备根据确定得到的预设时段内的历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,控制至少一个导播摄像机的导播状态。由于历史发言信息可以反映会场中的参会者发言模式、以及参会者的重要程度等信息,因此,根据历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,导播设备可以对导播摄像机进行更为精确的控制,避免了导播摄像机进行频繁及不必要的切换。
附图说明
图1为本申请提供的摄像跟踪方法的应用场景示意图;
图2为本申请提供的摄像跟踪方法的流程示意图;
图3为本申请提供的确定发言对象方法的流程示意图;
图4为本申请提供的实物坐标系;
图5为本申请提供的控制导播摄像机方法的流程示意图;
图6A为本申请提供的视频画面示意图一;
图6B为本申请提供的视频画面示意图二;
图6C为本申请提供的视频画面示意图三;
图6D为本申请提供的视频画面示意图四;
图6E为本申请提供的视频画面示意图五;
图7为本申请提供的摄像跟踪装置的结构示意图一;
图8为本申请提供的摄像跟踪装置的结构示意图二;
图9为本申请提供的导播设备的结构示意图。
具体实施方式
图1为本申请提供的摄像跟踪方法的应用场景示意图。请参见图1,在本地会场中设置有导播系统,导播系统可以跟踪拍摄会场中发言对象,并将拍摄的视频流实时传送至其它会场。导播系统包括摄像支架101、导播摄像机102、第一摄像机103、麦克风(Microphone,简称MIC)阵列104、导播设备105和终端设备106。其中,导播摄像机102、第一摄像机103和MIC阵列104分别设置在摄像支架上。
可选的,导播摄像机102的个数还可以为1个、3个等,导播摄像机102拍摄的视频流用于传输给其它会场,需要说明的是,当导播摄像机102的个数为多个时,在同一时刻,可能仅部分导播摄像机102(一个或多个)拍摄的视频流用于传输给其它会场。第一摄像机103可以为双目摄像机,第一摄像机103可以拍摄得到整个会场中的画面,第一摄像机拍摄103的视频仅用于进行本地处理,不用于向其它会场发送。MIC阵列104中的多个MIC采集得到的音频仅用于进行本地处理,不用于向其它会场发送。本申请对于导播系统中包括的导播摄像机的个数不作具体限定,本申请对MIC阵列中包括的MIC的个数也不作具体限定。
在实际应用过程中,第一摄像机103可以实时采集会场中的视频信息,并将采集得到的视频信息传输给导播设备105。视频MIC阵列104中的MIC可以实时采集会场中的音频,并将采集得到的音频信息传输给导播设备105。导播设备105可以根据获取得到的视频信息和音频信息,确定当前时刻需要拍摄的对象,并根据当前时刻需要拍摄的对象的位置,对导播摄像机的拍摄角度或焦距进行控制。导播摄像机102将采集得到的视频流实时发送给导播设备。同时,导播设备105还确定当前需要向其它会场发送哪个导播摄像机拍摄的视频流,并将确定得到的导播摄像机的视频流发送给终端设备106,以使终端设备106将接收到的视频流发送给其它会场。当然,若终端设备106为视频播放设备,则终端设备可以在本地播放接收到的视频流。
在本申请中,在导播设备105对导播摄像机102进行控制的过程中,导播设备105根据第一摄像机103和MIC阵列104在当前时刻采集得到的信息可以确定当前发言对象,导播设备105还根据第一摄像机103和MIC阵列104在历史时段内采集得到的信息确定历史发言信息,导播设备105根据当前发言对象和历史发言信息对导播摄像机进行控制。由于历史发言信息可以反映会场中的参会者发言模式、以及参会者的重要程度等信息,因此,导播设备105根据当前发言对象和历史发言信息,可以对导播摄像机进行更为精确的控制,避免了导播摄像机进行频繁及不必要的切换。
下面,通过具体实施例对本申请所示的技术方案进行详细说明。需要说明的是,下面几个具体实施例可以相互结合,对于相同或相似的内容,在不同的实施例中不再进行赘述。
图2为本申请提供的摄像跟踪方法的流程示意图。请参见图2,该方法可以包括:
S201、导播设备根据第一摄像机在预设时段内采集的第一视频信息、以及多个MIC在预设时段内采集的第一音频信息,确定预设时段内的历史发言信息,第一摄像机用于采集 本地视频。
需要说明的是,导播设备可以参见图1实施例中导播设备105,第一摄像机可以参见图1实施例中的第一摄像机103,多个MIC可以参见图1实施例中的MIC阵列104,此处不再进行赘述。
可选的,预设时段可以为当前时刻之前的一个时段,例如,预设时段可以为当前时刻之前的30秒、1分钟、5分钟、10分钟等。本申请对预设时段的时长不作具体限定。
需要说明的是,在视频会议进行过程中,不同时刻对应的预设时段也可以不同,例如,在会议刚开始不久时,预设时段的时长可以较短,在会议持续较长时间之后,预设时段的时长可以较长。例如,在会议开始后的1分钟至5分钟内,预设时段的时长可以为1分钟,在会议开始5分钟之后,预设时段的时长可以为5分钟。在实际应用过程中,可以根据实际需要设置该预设时段的时长。
可选的,可以获取预设时段内每一个时刻对应的视频信息和音频信息,根据预设时段内每一个时刻对应的视频信息和音频信息,确定每一个时刻对应的发言对象,并对每一个时刻对应的发言对象进行统计,得到历史发言信息。需要说明的是,在图3所示的实施例中,对确定某一时刻对应的发言对象的方法进行详细说明,此处不再进行赘述。
可选的,历史发言信息可以如下信息中的至少一种:预设时段内发言对象的个数、每一个发言对象的发言时长、每一个发言对象的发言内容、每一个发言对象的发言次数、每一次发言的发言时长、每一次发言的发言时刻、以及每一个发言对象的优先级等。当然,在实际应用过程中,可以根据实际需要确定历史发言信息中包括的内容,本申请对此不作具体限定。
S202、导播设备根据第一摄像机在当前时刻采集的第二视频信息和多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象。
可选的,可以通过如下可行的实现方式确定当前发言对象:可以获取第一摄像机在当前时刻拍摄的视频信息中每一个人脸图像对应的水平角度和垂直角度,获取多个MIC在当前时刻采集的音频信息,根据多个MIC在当前时刻采集的音频信息,确定在当前时刻时的声源对应的水平角度和垂直角度,根据每一个人脸图像对应的水平角度和垂直角度、及声源对应的水平角度和垂直角度,确定在当前时刻发言的当前发言对象。
需要说明的是,在图3所示的实施例中对确定当前时刻对应的当前发言对象的方法进行详细说明,此处不再进行赘述。
S203、导播设备根据历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,控制至少一个导播摄像机的导播状态,导播摄像机用于向其它会场发送导播视频流。
在本申请中,至少一个导播摄像机的个数可以为1个,也可以为多个。
当导播摄像的个数为1个时,该一个导播摄像机实时进行拍摄,且该一个导播摄像机拍摄的视频流被发送给其它会场。相应的,至少一个导播摄像机在当前时刻拍摄的发言对象为:该一个导播摄像机在当前时刻拍摄的发言对象。
当导播摄像机的个数为多个时,该多个导播摄像机均进行实时拍摄,但是,在同一时刻,可能部分导播摄像机拍摄的视频流被发送给其它会场,例如,可能只有一个或两个导 播摄像机拍摄的视频流被发送给其它会场。相应的,至少一个导播摄像机在当前时刻拍摄的发言对象为:视频流被发送给其它会场的导播摄像机在当前时刻拍摄的发言对象。
可选的,导播摄像机的导播状态包括导播摄像机的摄像角度或导播摄像机的焦距。
可选的,假设至少一个导播摄像机中包括第一导播摄像机和第二导播摄像机,且在当前时刻,第一导播摄像机拍摄的导频视频被发送至其它会场,相应的,可以通过如下可行的实现方式控制至少一个导播摄像机的导播状态:
当当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象相同时,导播设备保持第一导播摄像机的摄像角度和焦距不变。可选的,在该种情况下,导播设备也可以保持第二导播摄像机的摄像角度和焦距不变。当然,导播设备也可以根据历史发言对象、及当前发言对象估计下一个发言对象,并根据下一个发言对象的位置,调节第二导播摄像机的摄像角度和焦距,以使估计的下一个发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置。可选的,第二导播摄像机的摄像目标位置可以为第二导播摄像机的拍摄镜头的中心位置,也可以为第二导播摄像机的拍摄镜头的中心偏上位置等。在实际应用过程中,可以根据实际需要设置该摄像目标位置,本申请对此不作具体限定。
当当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象不同时,导播设备根据历史发言信息,调整第一导播摄像机和第二导播摄像机中的至少一个的摄像角度或焦距。可选的,导播设备可以仅对第二导播摄像机的摄像角度或焦距进行调节,导播设备也可以同时对第一导播摄像机和第二导播摄像机的摄像角度和焦距进行调节,当然,导播设备还可以仅对第一导播摄像机的摄像角度或焦距进行调节。需要说明的是,在图5所示的实施例中对该种情况下对至少一个导播摄像机的控制过程进行详细说明,此处不再进行赘述。
本申请提供的摄像跟踪方法,在视频导播过程中,导播设备根据确定得到的预设时段内的历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,控制至少一个导播摄像机的导播状态。由于历史发言信息可以反映会场中的参会者发言模式、以及参会者的重要程度等信息,因此,根据历史发言信息、当前发言对象和至少一个导播摄像机在当前时刻拍摄的发言对象,导播设备可以对导播摄像机进行更为精确的控制,避免了导播摄像机进行频繁及不必要的切换。
在图2所示实施例的基础上,可选的,在S201中,在确定历史发言信息的过程中需要确定预设时段内各时刻对应的发言对象,在S202中,需要确定当前时刻对应的当前发言对象。确定预设时段内每一个时刻对应的发言对象的过程、与确定当前时刻对应的当前发言对象的过程类似,下面,以确定第一时刻(预设时段内的任意一个时刻或者当前时刻)对应的发言对象的过程为例,对确定某一时刻对应的发言对象的过程进行详细说明。具体的,请参见图3所示的实施例。需要说明的是,在图3所示的实施例中,以第一摄像机为双目摄像机为例进行说明。
图3为本申请提供的确定发言对象方法的流程示意图。请参见图3,该方法可以包括:
S301、导播设备根据第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度。
下面,结合图4所示的实物坐标系、和如下步骤A-步骤D,确定每一个人脸图像对应的水平角度和垂直角度。
图4为本申请提供的实物坐标系。请参见图4,摄像支架包括水平支架M和垂直支架N,水平支架M和垂直支架N相互垂直,且水平之间M和垂直支架N之间的交点为水平支架的中点O1。水平支架M的中点为O1,垂直支架N的中点为O2。设置在水平支架M上的第一摄像机包括摄像镜头A1和摄像镜头A2,摄像镜头A1和摄像镜头A2关于O1对称设置。在本申请中,双目坐标系为三维坐标系(图中为示出),双目坐标系可以以摄像镜头A1为坐标原点,也可以以摄像镜头A2为坐标原点。
步骤A、获取每一个人脸图像在双目摄像机中两个摄像镜头中的二维坐标。
可选的,人脸图像在摄像镜头中的二维坐标可以为:人脸图像在摄像镜头拍摄的图片中的二维坐标。双目摄像机通常具有两个摄像镜头,两个摄像镜头之间具有一定的距离,以使得两个摄像镜头拍摄同一物体时,该同一物体在两个摄像镜头中的二维坐标不同。
例如,请参见图4,双目摄像机的两个摄像镜头分别为摄像镜头A1和摄像镜头A2,在同一时刻,摄像镜头A1对对象P拍摄得到的图像可以如图像P1所示,在图像P1中,对象P的人脸图像位于图像P1的左侧,摄像镜头A2对对象P拍摄得到的图像可以如图像P2所示,在图像P2中,对象P的人脸图像位于图像P2的右侧。由上可知,对象P的人脸图像在摄像镜头A1中的二维坐标、在摄像镜头A2中的二维坐标不同。
步骤B、根据双目摄像机的两个摄像镜头之间的距离、及每一个人脸信息在两个摄像镜头中的二维坐标,确定每一个人脸图像的深度。
其中,人脸图像的深度为人脸与双目摄像机之间的距离。
请参见图4,两个摄像镜头之间的距离为摄像镜头A1和摄像镜头A2之间的距离。人脸图像的深度为对象P与水平支架M之间的距离。例如,可以通过对象P向水平支架M所在的直线做垂线,得到垂直交点,对象P与垂直交点之间的距离为人脸图像的深度s。
需要说明的是,可以根据现有技术中的公式确定步骤B中所示的人脸图像的深度,本申请不再进行赘述。
步骤C、导播设备根据每一个人脸图像的深度,确定每一个人脸图像在双目坐标系中的三维坐标,双目坐标系为以双目摄像机的一个摄像镜头为原点的三维坐标系。
请参见图4,双目坐标系可以为以摄像镜头A1为坐标原点的三维坐标系,也可以为以摄像镜头A2为坐标原点的三维坐标系。
需要说明的是,可以根据现有技术中的公式确定步骤C中所述的三维坐标,本申请不再进行赘述。
步骤D、根据每一个人脸图像在双目坐标系中的三维坐标,确定每一个人脸图像对应的水平角度和垂直角度。
请参见图4,水平角度为直线PO1与水平支架之间的角度α。垂直角度为PO2与垂直支架之间的角度β。
可选的,可以通过如下公式一,确定人脸图像对应的水平角度α:
Figure PCTCN2018099340-appb-000001
其中,(x,y,z)为人脸图像在双目坐标系中的三维坐标,dx为一个摄像镜头与水 平支架中点之间的距离,例如,请参见图4,dx为A1与O1之间的距离。
可选的,可以通过如下公式二确定人脸图像对应的垂直角度β:
Figure PCTCN2018099340-appb-000002
其中,(x,y,z)为人脸图像在双目坐标系中的三维坐标,dy为垂直支架长度的一半,例如,请参见图4,dy可以为O1与O2之间的距离。
S302、导播设备根据第一时刻对应的音频信息,确定在第一时刻时的声源对应的水平角度和垂直角度。
在本申请中,由于多个MIC在摄像支架中的设置位置不同,因此,对于同一声源,不同的MIC采集得到的音频信息不同,例如,对于同一声源,不同MIC采集得到的音频信息的幅度或者相位不同。
需要说明的是,可以根据现有技术中的方法确定在第一时刻时的声源对应的水平角度和垂直角度,本申请不再进行赘述。
S303、根据每一个人脸图像对应的水平角度和垂直角度、及声源对应的水平角度和垂直角度,确定第一时刻对应的发言对象。
可选的,可以通过如下可行的实现方式确定第一时刻对应的发言对象:
根据每一个人脸图像对应的水平角度和垂直角度、及声源对应的水平角度和垂直角度,确定每一个人脸图像对应的人脸与声源之间的距离,根据每一个人脸图像对应的人脸与声源之间的距离,确定第一时刻对应的发言对象。
可选的,可以通过如下公式三,确定人脸与声源之间的距离;
Figure PCTCN2018099340-appb-000003
其中,α为人脸图像对应的水平角度,β为人脸图像对应的垂直角度。α 1为声源对应的水平角度,β 1为声源对应的垂直角度。
在图3所示的实施例中,在确定第一时刻对应的发言对象时,结合双目摄像机在第一时刻拍摄的视频信息、以及多个MIC在第一时刻采集得到的音频信息,确定发言对象,其中,根据多个MIC在第一时刻采集得到的音频信息,可以初步确定得到发言对象的大致位置,进一步的,根据双目摄像机在第一时刻拍摄的视频信息,可以准确的确定得到人脸在会场终端的位置,因此,结合视频信息和音频信息可以精确的确定得到发言对象。
在上述任意一个实施例的基础上,下面,结合图5所示的实施例,以导播摄像机包括第一导播摄像机和第二导播摄像机、且第一导播摄像机拍摄的导频视频在当前时刻被发送至其它会场为例,对导播摄像机的控制过程进行详细说明。
需要说明的是,当导播设备判断当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象相同时,则无需对导播摄像机进行调整。当导播设备判断当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象不同时,才需要对导播摄像机进行调整。在图5所示的实施例中,仅对当前发言对象和第一导播摄像机在当前时刻拍摄的发言对象不同时,对导播摄像机的调整过程进行说明。
图5为本申请提供的控制导播摄像机方法的流程示意图。请参见图5,该方法可以包 括:
S501、导播设备根据预设时段内每一个发言对象的优先级、及每一个发言对象在预设时段内每次发言的发言时长,确定每一个发言对象的有效发言次数。
在本申请中,会场中的每一个参会者对应一个优先级。
可选的,参会者的优先级可以由管理员预先在导播设备中进行设置。例如,管理人员可以根据参会者在本次会议中的身份、或者参会者的职务确定每一个参会者的优先级,并将参会者的信息(例如人脸信息或者声音特征)、与对应的优先级导入导播设备,这样,在会议过程中,导播设备可以根据采集得到的视频信息或者音频信息可以确定参会者的优先级。
可选的,还可以在会议开始时,由导播设备根据第一摄像机拍摄的视频信息,确定每一个参会者在会场中的座位,并根据参会者在会场中的座位,为每一个参会者确定一个优先级,其中,会场中的座位具有预设的优先级。例如,坐在会场中心位置的参会者的优先级最高,座位越远离会场中心位置的参会者的优先级越低。
在会议进行中,导播设备还可以根据每一个参会者的发言情况,对参会者的优先级进行更新。例如,当参会者发言次数较多、或者参会者的发言时长较长,可以提升该参会者的优先级,当参会者的发言次数较少、或者参会者的发言时长较短,则可以降低该参会者的优先级,当参会者的发言内容中包括预设关键词(例如,同志们、大伙、加油、努力等),可以提高参会者的优先级。当然,在实际应用过程中,还可以根据其它内容对参会者的优先级进行更新,本申请对此不作具体限定。
可选的,不同优先级对应有不同的有效发言时长阈值,优先级越高,有效发言时长阈值越低。当一个发言对象的一次发言的发言时长大于该发言对象的优先级对应的有效发言时长阈值时,则可以确定该次发言为一次有效发言。
可选的,针对预设时段内任意的第一发言对象,当需要获取第一发言对象的有效发言次数时,先获取第一发言对象在预设时段内的优先级,并获取该优先级对应的有效发言时长阈值;还获取第一发言对象在预设时段内每次发言的发言时长,并获取发言时长大于有效发言时长阈值的发言次数,该发言次数即为有效发言次数。
需要说明的是,在实际应用过程中,可以根据实际需要设置优先级对应的有效发言时长阈值,本申请对此不作具体限定。
S502、导播设备将有效发言次数大于或等于1的发言对象确定为有效发言对象。
可选的,若一个发言对象的有效发言次数大于1,则可以将该发言对象确定为有效发言对象。
S503、导播设备将有效发言对象的个数确定为有效发言人数。
在导播设备确定得到每一个发言对象是否为有效发言对象之后,可以统计有效发言对象的个数,以确定有效发言人数。
需要说明的是,导播设备通过S501-S503可以确定得到预设时段内的有效发言人数,当然,在实际应用过程中,导播设备还可以通过其它可行的实现方式确定有效发言人数,本申请对此不作具体限定。
S504、导播设备根据预设时段内的有效发言人数,确定在预设时段内的发言模式。
可选的,发言模式包括单人发言模式、双人辩论模式和多人讨论模式中的至少一种。当然,发言模式还可以包括其它,例如三人辩论模式等,本申请对发言模式不作具体限定。
可选的,当有效发言人数的个数不同时,确定发言模式的方法也不同,可以包括至少如下三种可能的实现方式:
第一种可能的实现方式:有效发言人数为1。
在该种可能的实现方式下,由于预设时段内仅有一个有效发言对象,因此,导播设备可以确定预设时段内的发言模式为单人发言模式。
第二种可能的实现方式:有效发言人数为2。
在该种可能的实现方式下,当有效发言个数为2时,则说明在预设时段内有两个有效发言对象。若该两个有效发言对象为交替发言、且交替次数较多时,则可以确定该两个有效发言对象在进行辩论,则可以确定发言模式为双人辩论模式,否则,则确定发言模式为单人发言模式。
第三种可能的实现方式:有效发言人数大于2。
在该种可能的实现方式下,导播设备根据预设时段内的至少两个有效发言对象优先级,确定预设时段内的发言模式为单人发言模式或多人讨论模式。
可选的,可以根据至少两个有效发言对象的优先级,判断至少两个有效发言对象中是否包括重要发言对象,其中,重要发言对象的优先级可以为预设优先级,或者,重要发言对象的优先级高于预设优先级。
若至少两个有效发言对象中包括重要发言对象,则导播设备可以确定预设时段内的发言模式为单人发言模式。若至少两个有效发言对象中不包括重要发言对象,则导播设备可以确定预设时段内的发言模式为多人讨论模式。
在导播设备确定得到预设时段内的发言模式之后,导播设备根据预设时段内的发言模式,调整第一导播摄像机和第二导播摄像机中的至少一个的摄像角度或焦距,其中,当发言模式不同时,对导播摄像机的调节过程也不同。
可选的,当确定得到发言模式为单人发言模式时,则可以通过S505-S507对导播摄像机进行调节。当确定得到发言模式为双人辩论模式时,则可以通过S508-S512对导播摄像机进行调节。当确定得到发言模式为多人讨论模式时,则可以通过S513-S517对导播摄像机进行调节。
S505、导播设备在预设时段内有效发言对象中确定目标发言对象。
在本申请中,无论预设时段内的发言人数为多少,均可能将预设时段内的发言模式确定为单人发言模式,相应的,当已确定发言模式为单人发言模式时,根据预设时段内的有效发言人数不同,确定目标发言对象的方式也不同,具体的,可以至少如下三种可能的实现方式:
第一种可能的实现方式:预设时段内的有效发言人数为1。
在该种可能的实现方式中,可以将预设时段内的一个有效发言对象确定为目标发言对象。
第二种可能的实现方式:预设时段内的有效发言人数为2。
在该种可能的实现方式中,可以根据两个有效发言对象的优先级在两个有效发言对象 中确定目标发言对象。例如,可以将该两个有效发言对象中、优先级较高的有效发言对象确定为目标发言对象。
第三种可能的实现方式:预设时段内的有效发言人数大于2。
在该种可能的实现方式中,可以根据预设时段内有效发言对象的优先级确定重要发言对象,并将重要发言对象确定为目标发言对象。
需要说明的是,在实际应用过程中,还可以根据其它可行的实现方式确定目标发言对象,本申请对此不作具体限定。
S506、导播设备调节第二导播摄像机的摄像角度或焦距,以使目标发言对象的人脸图像位于第二导播摄像机的摄像目标位置。
由于在当前时刻,第一导播摄像机拍摄的视频流被传送至其它会场。当确定得到发言模式为单人发言模式时,为了避免传输至其它会场的视频流中的画面出现缩放和平移过程,则可以对第二导播摄像机进行调节,在对第二导播摄像机进行调节的过程中,仍向其它会场传输第一导播摄像机拍摄的视频流。
需要说明的是,若在当前时刻,第一导播摄像机和第二导播摄像机拍摄的视频流均被传送至其它会场,则此时可以对任意一个导播摄像机进行调节,或者,对需要调节幅度较小的导播摄像机进行调节。
可选的,可以通过如下步骤A-步骤D调节第二导播摄像机的摄像角度或焦距:
步骤A、导播设备获取目标发言对象在双目坐标系中的三维坐标。
需要说明的是,该步骤A的执行过程可以参见S301中的步骤A-步骤C,此处不再进行赘述。
步骤B、导播设备根据双目摄像机(第一摄像机)和第二导播摄像机之间的外部参数,确定目标发言对象在导播坐标系中的三维坐标。
其中,导播坐标系是指以第二导播摄像机的初始位置为原点的三维坐标系。
可选的,双目摄像机和第二导播摄像机之间的外部参数包括双目摄像机与第二导播摄像机之间的距离等。
步骤C、导播设备根据目标发言对象在导播坐标系中的三维坐标,确定目标发言对象在第二导播摄像机的摄像镜头中的二维坐标。
步骤D、根据目标发言对象与第二导播摄像机之间的距离、以及目标发言对象在第二导播摄像机的摄像镜头中的二维坐标和需要达到的目标位置,计算第二导播摄像机的焦距、以及第二导播摄像机需要转动的水平角度和垂直角度。
需要说明的是,再通过上述步骤A-步骤D,对第二导播摄像机进行调节之后,还可以判断第二导播摄像机拍摄得到的人脸是否处于镜头的目标位置,若否,则对第二导播摄像机进行微调,直至目标发言对象的人脸图像位于第二导播摄像机的摄像目标位置。
下面,结合图6A,对第二导播摄像机拍摄的视频中的画面进行说明。
图6A为本申请提供的视频画面示意图一。请参见图6A,视频画面中包括一个发言对象,且该一个有效发言对象位于画面的目标位置,例如,该目标位置为画面的中心位置。
S507、导播设备向终端设备发送第二导播摄像机拍摄的视频流,以使终端设备将第二导播摄像机拍摄的视频流发送至其它会场。
导播设备向终端设备发送第二导播摄像机拍摄的视频流之后,终端设备则将接收到的视频流发送给其它会场。
S508、判断预设时段内的两个有效发言对象之间的距离是否小于预设距离。
若是,则执行S509-S510。
若否,则执行S511-S512。
S509、导播设备调整第二导播摄像机的摄像角度或焦距,以使两个有效发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置。
需要说明的是,S509的执行过程可以参见S506,此处不再进行赘述。
下面,结合图6B,对第二导播摄像机拍摄的视频中的画面进行说明。
图6B为本申请提供的视频画面示意图二。请参见图6B,视频画面中包括两个有效发言对象,且该两个有效发言对象位于画面的中心位置。
S510、导播设备向终端设备发送第二导播摄像机拍摄的视频流,以使终端设备将第二导播摄像机拍摄的视频流发送至其它会场。
需要说明的是,S510的执行过程可以参见S507,此处不再进行赘述。
S511、导播设备调整第一导播摄像和第二导播摄像机中至少一个的摄像角度或焦距,以使两个有效发言对象中的一个有效发言对象对应的人脸图像位于第一导播摄像机的摄像目标位置、另一个有效发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置。
需要说明的是,对第一导播摄像机或第二导播摄像机的调节过程可以参见S506,此处不再进行赘述。
S512、导播设备向终端设备发送第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流,以使终端设备将第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流发送至其它会场。
在终端设备接收到第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流之后,终端设备则将第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流均发送给其它会场,以使其它会场可以同时播放第一导播摄像机和第二导播摄像机拍摄的视频流。
可选的,在导播设备向终端设备发送第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流之前,还可以对第一导播摄像机拍摄的视频流和第二导播摄像机拍摄的视频流进行组合,并向终端设备发送组合后的视频流。
可选的,在终端设备向其它会场发送第一导播摄像机和第二导播摄像机拍摄的视频流之前,也可以先对第二导播摄像机和第二导播摄像机拍摄的视频流进行组合,再向其它会场发送组合后的视频流。
下面,结合图6C,对第二导播摄像机拍摄的视频中的画面进行说明。
图6C为本申请提供的视频画面示意图三。请参见图6C,视频画面中包括两个有效发言对象,该两个有效发言对象分别由第一导播摄像机和第二导播摄像机拍摄得到,且该两个有效发言对象对应的画面由终端设备组合在一起。
S513、导播设备判断至少两个有效发言对象之间的距离小于预设距离是否小于预设距离。
若是,则执行S514-S515。
若否,则执行S516-S517。
可选的,导播设备可以将相距最远的两个有效发言对象之间的距离确定为至少两个有效发言对象之间的距离。
S514、导播设备调整第二导播摄像机的摄像角度或焦距,以使至少两个有效发言对象对应的人脸图像位于第二导播摄像机的摄像目标位置。
需要说明的是,S514的执行过程可以参见S506,此处不再进行赘述。
下面,结合图6D,对第二导播摄像机拍摄的视频中的画面进行说明。
图6D为本申请提供的视频画面示意图四。请参见图6D,假设至少两个发言对象的个数为三个,则视频画面中包括三个有效发言对象,且该两个有效发言对象位于画面的中心位置。
S515、导播设备向终端设备发送第二导播摄像机拍摄的视频流,以使终端设备将第二导播摄像机拍摄的视频流发送至其它会场。
需要说明的是,S515的执行过程可以参见S507,此处不再进行赘述。
S516、导播设备调整第二导播摄像机的摄像角度或焦距,以使第二导播摄像机拍摄全景视频。
需要说明的是,S516的执行过程可以参见S506,此处不再进行赘述。
下面,结合图6E,对第二导播摄像机拍摄的视频中的画面进行说明。
图6E为本申请提供的视频画面示意图五。请参见图6E,第二导播摄像机拍摄的视频面为全景画面,该全景画面中包括了会场中所有参会者的人脸图像。
S517、导播设备向终端设备发送第二导播摄像机拍摄的视频流,以使终端设备将第二导播摄像机拍摄的视频流发送至其它会场。
需要说明的是,S517的执行过程可以参见S507,此处不再进行赘述。
在图5所示的实施例中,导播设备根据预设时段内的历史发言信息确定发言模式,并根据发言模式对导播摄像机进行控制。预设时段内的发言模式可以体现会议的真实场景,根据会议的真实场景可以对导播摄像机进行准确的控制,避免了导播摄像机进行频繁及不必要的切换。
图7为本申请提供的摄像跟踪装置的结构示意图一。该摄像跟踪装置可以设置在图1实施例所示的导播设备中。请参见图7,该装置可以包括第一确定模块11、第二确定模块12和控制模块13,其中,
所述第一确定模块11用于,根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在所述预设时段内采集的第一音频信息,确定所述预设时段内的历史发言信息,所述第一摄像机用于采集本地视频;
所述第二确定模块12用于,根据所述第一摄像机在当前时刻采集的第二视频信息和所述多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象;
所述控制模块13用于,根据所述历史发言信息、所述当前发言对象和至少一个导播摄像机在所述当前时刻拍摄的发言对象,控制所述至少一个导播摄像机的导播状态,所述导播摄像机用于向其它会场发送导播视频流。
本申请提供的摄像跟踪装置可以执行上述方法实施例所示的技术方案,其实现原理以 及有益效果类似,此处不再进行赘述。
在一种可能的实施方式中,导播摄像机的导播状态包括摄像角度或焦距,所述至少一个导播摄像机中包括第一导播摄像机和第二导播摄像机;所述控制模块13具体用于:
当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象相同时,保持所述第一导播摄像机的摄像角度和焦距不变,所述第一导播摄像机拍摄的导频视频在当前时刻被发送至其它会场;
当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,根据所述历史发言信息,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
图8为本申请提供的摄像跟踪装置的结构示意图二。在图7所示实施例的基础上,请参见图8,所述控制模块13包括确定单元131和调整单元132,其中,
所述确定单元131用于,在所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,根据所述历史发言信息,确定在所述预设时段内的发言模式,所述发言模式包括单人发言模式、双人辩论模式和多人讨论模式中的至少一种;
所述调整单元132用于,根据所述预设时段内的发言模式,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
在另一种可能的实施方式中,所述确定单元131具体用于:
根据所述历史发言信息,确定所述预设时段内的有效发言人数;
根据所述预设时段内的有效发言人数,确定在所述预设时段内的发言模式。
在另一种可能的实施方式中,所述确定单元131具体用于:
根据所述预设时段内每一个发言对象的优先级、及每一个发言对象的在所述预设时段内每次发言的发言时长,确定每一个发言对象的有效发言次数;
将有效发言次数大于或等于1的发言对象确定为有效发言对象;
将所述有效发言对象的个数确定为所述有效发言人数。
在另一种可能的实施方式中,所述确定单元131具体用于:
当所述有效发言人数为1时,确定所述预设时段内的发言模式为单人发言模式;
当所述有效发言人数为2时,若所述两个有效发言人交替发言,确定所述预设时段内的发言模式为单人发言模式或双人辩论模式;
当所述有效发言人数大于2时,根据所述预设时段内的至少两个有效发言对象优先级,确定所述预设时段内的发言模式为单人发言模式或多人讨论模式。
在另一种可能的实施方式中,所述确定单元131具体用于:
若所述至少两个有效发言对象中包括重要发言对象,确定所述预设时段内的发言模式为单人发言模式;
若所述至少两个有效发言对象中不包括重要发言对象,确定所述预设时段内的发言模式为多人讨论模式。
在另一种可能的实施方式中,所述发言模式为单人发言模式;所述调整单元132具体用于:
在所述预设时段内有效发言对象中确定目标发言对象;
调节所述第二导播摄像机的摄像角度或焦距,以使所述目标发言对象的人脸图像位于所述第二导播摄像机的摄像目标位置。
在另一种可能的实施方式中,所述调整单元132具体用于:
当所述预设时段内的有效发言人数为1时,则将所述预设时段内的一个有效发言对象确定为所述目标发言对象;
当所述预设时段内的有效发言人数为2时,则根据所述两个有效发言对象的优先级在所述两个有效发言对象中确定目标发言对象;
当所述预设时段内的有效发言人数大于2时,则将在所述预设时段内发言的重要发言对象确定为所述目标发言对象。
在另一种可能的实施方式中,所述发言模式为双人辩论模式;所述调整单元132具体用于:
若所述预设时段内的两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
若所述预设时段内的两个有效发言对象之间的距离大于或等于预设距离,调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距,以使所述两个有效发言对象中的一个有效发言对象对应的人脸图像位于所述第一导播摄像机的摄像目标位置、另一个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置。
在另一种可能的实施方式中,所述发言模式为多人讨论模式;所述调整单元132具体用于:
若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述至少两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述第二导播摄像机拍摄全景视频。
在另一种可能的实施方式中,所述装置还包括发送模块14,其中,
所述发送模块14用于,在所述调整单元132调整所述第二导播摄像机的摄像角度或焦距之后,向终端设备发送所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第二导播摄像机拍摄的视频流发送至其它会场。
在另一种可能的实施方式中,所述发送模块14还用于:
在所述调整单元132调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距之后,向终端设备发送所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流发送至其它会场。
在另一种可能的实施方式中,所述第一确定模块11具体用于:
根据所述预设时段内各时刻对应的视频信息和音频信息,确定各时刻对应的发言对象;
对各时刻对应的发言对象进行统计,得到所述历史发言信息,所述历史发言信息包括如下信息中的至少一种:所述预设时段内的发言对象个数、每一个发言对象的发言时长、 每一个发言对象的发言次数、每一个发言对象的发言内容、每一次发言的发言时长、每一次发言的发言时刻、和每一个发言对象的优先级。
在另一种可能的实施方式中,针对所述预设时段内的第一时刻,所述第一确定模块11具体用于:
根据所述第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度;
根据所述第一时刻对应的音频信息,确定在所述第一时刻时的声源对应的水平角度和垂直角度;
根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定所述第一时刻对应的发言对象。
在另一种可能的实施方式中,所述第一摄像机为双目摄像机;所述第一确定模块11具体用于:
根据所述第一时刻的视频信息,获取每一个人脸图像在所述双目摄像机中两个摄像镜头中的二维坐标;
根据所述双目摄像机的两个摄像镜头之间的距离、及每一个人脸信息在所述两个摄像镜头中的二维坐标,确定每一个人脸图像的深度,人脸图像的深度为人脸与所述双目摄像机之间的距离;
根据每一个人脸图像的深度,确定每一个人脸图像在双目坐标系中的三维坐标,所述双目坐标系为以所述双目摄像机的一个摄像镜头为原点的三维坐标系;
根据每一个人脸图像在所述双目坐标系中的三维坐标,确定每一个人脸图像对应的水平角度和垂直角度。
在另一种可能的实施方式中,所述第一确定模块11具体用于:
根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定每一个人脸图像对应的人脸与所述声源之间的距离;
根据每一个人脸图像对应的人脸与所述声源之间的距离,确定所述第一时刻对应的发言对象。
本申请提供的摄像跟踪装置可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。
图9为本申请提供的导播设备的结构示意图。请参见图9,该导播设备包括:处理器21、存储器22及通信总线23,所述通信总线23用于实现各元器件之间的连接,所述存储器22用于存储程序指令,所述处理器21用于读取所述存储器22中的程序指令,并根据所述存储器22中的程序指令执行上述方法实施例所示的技术方案。
本申请提供的导播设备可以执行上述方法实施例所示的技术方案,其实现原理以及有益效果类似,此处不再进行赘述。
本申请提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当存储设备的至少一个处理器执行该计算机执行指令时,存储设备执行上述各种可能设计提供的摄像跟踪方法。
本申请提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中。存储设备的至少一个处理器可以从计算机可读存 储介质读取该计算机执行指令,至少一个处理器执行该计算机执行指令使得存储设备实施前述方法实施例中的各种可能设计提供的摄像跟踪方法。
本申请提供一种芯片系统,该芯片系统包括处理器,用于支持导播设备实现上述方面中所涉及的功能,例如,处理上述方法中所涉及的信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存导播设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其它分立器件。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (35)

  1. 一种摄像跟踪方法,其特征在于,包括:
    导播设备根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在所述预设时段内采集的第一音频信息,确定所述预设时段内的历史发言信息,所述第一摄像机用于采集本地视频;
    所述导播设备根据所述第一摄像机在当前时刻采集的第二视频信息和所述多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象;
    所述导播设备根据所述历史发言信息、所述当前发言对象和至少一个导播摄像机在所述当前时刻拍摄的发言对象,控制所述至少一个导播摄像机的导播状态,所述导播摄像机用于向其它会场发送导播视频流。
  2. 根据权利要求1所述的方法,其特征在于,导播摄像机的导播状态包括摄像角度或焦距,所述至少一个导播摄像机中包括第一导播摄像机和第二导播摄像机;所述导播设备根据所述历史发言信息、所述当前发言对象和至少一个导播摄像机在所述当前时刻拍摄的发言对象,控制所述至少一个导播摄像机的导播状态,包括:
    当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象相同时,所述导播设备保持所述第一导播摄像机的摄像角度和焦距不变,所述第一导播摄像机拍摄的导频视频在当前时刻被发送至其它会场;
    当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,所述导播设备根据所述历史发言信息,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
  3. 根据权利要求1或2所述的方法,其特征在于,
    当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,所述导播设备根据所述历史发言信息,确定在所述预设时段内的发言模式,所述发言模式包括单人发言模式、双人辩论模式和多人讨论模式中的至少一种;
    所述导播设备根据所述预设时段内的发言模式,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
  4. 根据权利要求3所述的方法,其特征在于,所述导播设备根据所述历史发言信息,确定在所述预设时段内的发言模式,包括:
    所述导播设备根据所述历史发言信息,确定所述预设时段内的有效发言人数;
    所述导播设备根据所述预设时段内的有效发言人数,确定在所述预设时段内的发言模式。
  5. 根据权利要求4所述的方法,其特征在于,所述导播设备获取所述预设时段内的有效发言人数,包括:
    所述导播设备根据所述预设时段内每一个发言对象的优先级、及每一个发言对象的在所述预设时段内每次发言的发言时长,确定每一个发言对象的有效发言次数;
    所述导播设备将有效发言次数大于或等于1的发言对象确定为有效发言对象;
    所述导播设备将所述有效发言对象的个数确定为所述有效发言人数。
  6. 根据权利要求4或5所述的方法,其特征在于,所述导播设备根据所述预设时段 内的有效发言人数,确定在所述预设时段内的发言模式,包括:
    当所述有效发言人数为1时,所述导播设备确定所述预设时段内的发言模式为单人发言模式;
    当所述有效发言人数为2时,若所述两个有效发言人交替发言,所述导播设备确定所述预设时段内的发言模式为单人发言模式或双人辩论模式;
    当所述有效发言人数大于2时,所述导播设备根据所述预设时段内的至少两个有效发言对象优先级,确定所述预设时段内的发言模式为单人发言模式或多人讨论模式。
  7. 根据权利要求6所述的方法,其特征在于,所述导播设备根据所述预设时段内的至少两个有效发言对象优先级,确定所述预设时段内的发言模式为单人发言模式或多人讨论模式,包括:
    若所述至少两个有效发言对象中包括重要发言对象,所述导播设备确定所述预设时段内的发言模式为单人发言模式;
    若所述至少两个有效发言对象中不包括重要发言对象,所述导播设备确定所述预设时段内的发言模式为多人讨论模式。
  8. 根据权利要求3-6任一项所述的方法,其特征在于,所述发言模式为单人发言模式;所述导播设备根据所述预设时段内的发言模式,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距,包括:
    所述导播设备在所述预设时段内有效发言对象中确定目标发言对象;
    所述导播设备调节所述第二导播摄像机的摄像角度或焦距,以使所述目标发言对象的人脸图像位于所述第二导播摄像机的摄像目标位置。
  9. 根据权利要求8所述的方法,其特征在于,所述导播设备在所述预设时段内有效发言对象中确定目标发言对象,包括:
    当所述预设时段内的有效发言人数为1时,则将所述预设时段内的一个有效发言对象确定为所述目标发言对象;
    当所述预设时段内的有效发言人数为2时,则根据所述两个有效发言对象的优先级在所述两个有效发言对象中确定目标发言对象;
    当所述预设时段内的有效发言人数大于2时,则将在所述预设时段内发言的重要发言对象确定为所述目标发言对象。
  10. 根据权利要求3-6任一项所述的方法,其特征在于,所述发言模式为双人辩论模式;所述导播设备根据所述预设时段内的发言模式,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距,包括:
    若所述预设时段内的两个有效发言对象之间的距离小于预设距离,所述导播设备调整所述第二导播摄像机的摄像角度或焦距,以使所述两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
    若所述预设时段内的两个有效发言对象之间的距离大于或等于预设距离,所述导播设备调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距,以使所述两个有效发言对象中的一个有效发言对象对应的人脸图像位于所述第一导播摄像机的摄像目标位置、另一个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置。
  11. 根据权利要求3-6任一项所述的方法,其特征在于,所述发言模式为多人讨论模式;所述导播设备根据所述预设时段内的发言模式,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距,包括:
    若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,所述导播设备调整所述第二导播摄像机的摄像角度或焦距,以使所述至少两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
    若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,所述导播设备调整所述第二导播摄像机的摄像角度或焦距,以使所述第二导播摄像机拍摄全景视频。
  12. 根据权利要求8-11任一项所述的方法,其特征在于,所述导播设备调整所述第二导播摄像机的摄像角度或焦距之后,还包括:
    所述导播设备向终端设备发送所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第二导播摄像机拍摄的视频流发送至其它会场。
  13. 根据权利要求10所述的方法,其特征在于,所述导播设备调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距之后,还包括:
    所述导播设备向终端设备发送所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流发送至其它会场。
  14. 根据权利要求1-13任一项所述的方法,其特征在于,所述导播设备根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在所述预设时段内采集的第一音频信息,确定所述预设时段内的历史发言信息,包括:
    所述导播设备根据所述预设时段内各时刻对应的视频信息和音频信息,确定各时刻对应的发言对象;
    所述导播设备对各时刻对应的发言对象进行统计,得到所述历史发言信息,所述历史发言信息包括如下信息中的至少一种:所述预设时段内的发言对象个数、每一个发言对象的发言时长、每一个发言对象的发言次数、每一个发言对象的发言内容、每一次发言的发言时长、每一次发言的发言时刻、和每一个发言对象的优先级。
  15. 根据权利要求14所述的方法,其特征在于,针对所述预设时段内的第一时刻,根据所述第一时刻的视频信息和音频信息,确定所述第一时刻对应的发言对象,包括:
    所述导播设备根据所述第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度;
    所述导播设备根据所述第一时刻对应的音频信息,确定在所述第一时刻时的声源对应的水平角度和垂直角度;
    所述导播设备根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定所述第一时刻对应的发言对象。
  16. 根据权利要求15所述的方法,其特征在于,所述第一摄像机为双目摄像机;所述导播设备根据所述第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度,包括:
    所述导播设备根据所述第一时刻的视频信息,获取每一个人脸图像在所述双目摄像机中两个摄像镜头中的二维坐标;
    所述导播设备根据所述双目摄像机的两个摄像镜头之间的距离、及每一个人脸信息在所述两个摄像镜头中的二维坐标,确定每一个人脸图像的深度,人脸图像的深度为人脸与所述双目摄像机之间的距离;
    所述导播设备根据每一个人脸图像的深度,确定每一个人脸图像在双目坐标系中的三维坐标,所述双目坐标系为以所述双目摄像机的一个摄像镜头为原点的三维坐标系;
    所述导播设备根据每一个人脸图像在所述双目坐标系中的三维坐标,确定每一个人脸图像对应的水平角度和垂直角度。
  17. 根据权利要求15或16所述的方法,其特征在于,所述根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定所述第一时刻对应的发言对象,包括:
    所述导播设备根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定每一个人脸图像对应的人脸与所述声源之间的距离;
    所述导播设备根据每一个人脸图像对应的人脸与所述声源之间的距离,确定所述第一时刻对应的发言对象。
  18. 一种摄像跟踪装置,其特征在于,包括第一确定模块、第二确定模块和控制模块,其中,
    所述第一确定模块用于,根据第一摄像机在预设时段内采集的第一视频信息、以及多个麦克风MIC在所述预设时段内采集的第一音频信息,确定所述预设时段内的历史发言信息,所述第一摄像机用于采集本地视频;
    所述第二确定模块用于,根据所述第一摄像机在当前时刻采集的第二视频信息和所述多个MIC在当前时刻采集得到的第二音频信息,确定当前发言对象;
    所述控制模块用于,根据所述历史发言信息、所述当前发言对象和至少一个导播摄像机在所述当前时刻拍摄的发言对象,控制所述至少一个导播摄像机的导播状态,所述导播摄像机用于向其它会场发送导播视频流。
  19. 根据权利要求18所述的装置,其特征在于,导播摄像机的导播状态包括摄像角度或焦距,所述至少一个导播摄像机中包括第一导播摄像机和第二导播摄像机;所述控制模块具体用于:
    当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象相同时,保持所述第一导播摄像机的摄像角度和焦距不变,所述第一导播摄像机拍摄的导频视频在当前时刻被发送至其它会场;
    当所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,根据所述历史发言信息,调整所述第一导播摄像机和所述第二导播摄像机中的至少一个的摄像角度或焦距。
  20. 根据权利要求18或19所述的装置,其特征在于,所述控制模块包括确定单元和调整单元,其中,
    所述确定单元用于,在所述当前发言对象和第一导播摄像机在所述当前时刻拍摄的发言对象不同时,根据所述历史发言信息,确定在所述预设时段内的发言模式,所述发言模式包括单人发言模式、双人辩论模式和多人讨论模式中的至少一种;
    所述调整单元用于,根据所述预设时段内的发言模式,调整所述第一导播摄像机和所 述第二导播摄像机中的至少一个的摄像角度或焦距。
  21. 根据权利要求20所述的装置,其特征在于,所述确定单元具体用于:
    根据所述历史发言信息,确定所述预设时段内的有效发言人数;
    根据所述预设时段内的有效发言人数,确定在所述预设时段内的发言模式。
  22. 根据权利要求21所述的装置,其特征在于,所述确定单元具体用于:
    根据所述预设时段内每一个发言对象的优先级、及每一个发言对象的在所述预设时段内每次发言的发言时长,确定每一个发言对象的有效发言次数;
    将有效发言次数大于或等于1的发言对象确定为有效发言对象;
    将所述有效发言对象的个数确定为所述有效发言人数。
  23. 根据权利要求21或22所述的装置,其特征在于,所述确定单元具体用于:
    当所述有效发言人数为1时,确定所述预设时段内的发言模式为单人发言模式;
    当所述有效发言人数为2时,若所述两个有效发言人交替发言,确定所述预设时段内的发言模式为单人发言模式或双人辩论模式;
    当所述有效发言人数大于2时,根据所述预设时段内的至少两个有效发言对象优先级,确定所述预设时段内的发言模式为单人发言模式或多人讨论模式。
  24. 根据权利要求23所述的装置,其特征在于,所述确定单元具体用于:
    若所述至少两个有效发言对象中包括重要发言对象,确定所述预设时段内的发言模式为单人发言模式;
    若所述至少两个有效发言对象中不包括重要发言对象,确定所述预设时段内的发言模式为多人讨论模式。
  25. 根据权利要求20-23任一项所述的装置,其特征在于,所述发言模式为单人发言模式;所述调整单元具体用于:
    在所述预设时段内有效发言对象中确定目标发言对象;
    调节所述第二导播摄像机的摄像角度或焦距,以使所述目标发言对象的人脸图像位于所述第二导播摄像机的摄像目标位置。
  26. 根据权利要求25所述的装置,其特征在于,所述调整单元具体用于:
    当所述预设时段内的有效发言人数为1时,则将所述预设时段内的一个有效发言对象确定为所述目标发言对象;
    当所述预设时段内的有效发言人数为2时,则根据所述两个有效发言对象的优先级在所述两个有效发言对象中确定目标发言对象;
    当所述预设时段内的有效发言人数大于2时,则将在所述预设时段内发言的重要发言对象确定为所述目标发言对象。
  27. 根据权利要求20-23任一项所述的装置,其特征在于,所述发言模式为双人辩论模式;所述调整单元具体用于:
    若所述预设时段内的两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
    若所述预设时段内的两个有效发言对象之间的距离大于或等于预设距离,调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距,以使所述两个有效发言 对象中的一个有效发言对象对应的人脸图像位于所述第一导播摄像机的摄像目标位置、另一个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置。
  28. 根据权利要求20-23任一项所述的装置,其特征在于,所述发言模式为多人讨论模式;所述调整单元具体用于:
    若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述至少两个有效发言对象对应的人脸图像位于所述第二导播摄像机的摄像目标位置;
    若所述预设时段内的至少两个有效发言对象之间的距离小于预设距离,调整所述第二导播摄像机的摄像角度或焦距,以使所述第二导播摄像机拍摄全景视频。
  29. 根据权利要求25-28任一项所述的装置,其特征在于,所述装置还包括发送模块,其中,
    所述发送模块用于,在所述调整单元调整所述第二导播摄像机的摄像角度或焦距之后,向终端设备发送所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第二导播摄像机拍摄的视频流发送至其它会场。
  30. 根据权利要求27所述的装置,其特征在于,所述发送模块还用于:
    在所述调整单元调整所述第一导播摄像和所述第二导播摄像机中至少一个的摄像角度或焦距之后,向终端设备发送所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流,以使所述终端设备将所述第一导播摄像机拍摄的视频流和所述第二导播摄像机拍摄的视频流发送至其它会场。
  31. 根据权利要求18-30任一项所述的装置,其特征在于,所述第一确定模块具体用于:
    根据所述预设时段内各时刻对应的视频信息和音频信息,确定各时刻对应的发言对象;
    对各时刻对应的发言对象进行统计,得到所述历史发言信息,所述历史发言信息包括如下信息中的至少一种:所述预设时段内的发言对象个数、每一个发言对象的发言时长、每一个发言对象的发言次数、每一个发言对象的发言内容、每一次发言的发言时长、每一次发言的发言时刻、和每一个发言对象的优先级。
  32. 根据权利要求31所述的装置,其特征在于,针对所述预设时段内的第一时刻,所述第一确定模块具体用于:
    根据所述第一时刻的视频信息,确定每一个人脸图像对应的水平角度和垂直角度;
    根据所述第一时刻对应的音频信息,确定在所述第一时刻时的声源对应的水平角度和垂直角度;
    根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定所述第一时刻对应的发言对象。
  33. 根据权利要求32所述的装置,其特征在于,所述第一摄像机为双目摄像机;所述第一确定模块具体用于:
    根据所述第一时刻的视频信息,获取每一个人脸图像在所述双目摄像机中两个摄像镜头中的二维坐标;
    根据所述双目摄像机的两个摄像镜头之间的距离、及每一个人脸信息在所述两个摄像镜头中的二维坐标,确定每一个人脸图像的深度,人脸图像的深度为人脸与所述双目摄像 机之间的距离;
    根据每一个人脸图像的深度,确定每一个人脸图像在双目坐标系中的三维坐标,所述双目坐标系为以所述双目摄像机的一个摄像镜头为原点的三维坐标系;
    根据每一个人脸图像在所述双目坐标系中的三维坐标,确定每一个人脸图像对应的水平角度和垂直角度。
  34. 根据权利要求32或33所述的装置,其特征在于,所述第一确定模块具体用于:
    根据每一个人脸图像对应的水平角度和垂直角度、及所述声源对应的水平角度和垂直角度,确定每一个人脸图像对应的人脸与所述声源之间的距离;
    根据每一个人脸图像对应的人脸与所述声源之间的距离,确定所述第一时刻对应的发言对象。
  35. 一种导播设备,其特征在于,包括:处理器、存储器及通信总线,所述通信总线用于实现各元器件之间的连接,所述存储器用于存储程序指令,所述处理器用于读取所述存储器中的程序指令,并根据所述存储器中的程序指令执行权利要求1-17任一项所述的方法。
PCT/CN2018/099340 2017-08-16 2018-08-08 摄像跟踪方法、装置及设备 WO2019033968A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18846441.6A EP3657781B1 (en) 2017-08-16 2018-08-08 Camera tracking method and apparatus, and device
US16/791,268 US10873666B2 (en) 2017-08-16 2020-02-14 Camera tracking method and director device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710702192.4A CN109413359B (zh) 2017-08-16 2017-08-16 摄像跟踪方法、装置及设备
CN201710702192.4 2017-08-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/791,268 Continuation US10873666B2 (en) 2017-08-16 2020-02-14 Camera tracking method and director device

Publications (1)

Publication Number Publication Date
WO2019033968A1 true WO2019033968A1 (zh) 2019-02-21

Family

ID=65361743

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/099340 WO2019033968A1 (zh) 2017-08-16 2018-08-08 摄像跟踪方法、装置及设备

Country Status (4)

Country Link
US (1) US10873666B2 (zh)
EP (1) EP3657781B1 (zh)
CN (1) CN109413359B (zh)
WO (1) WO2019033968A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112689092A (zh) * 2020-12-23 2021-04-20 广州市迪士普音响科技有限公司 一种自动跟踪的会议录播方法、系统、装置及存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110536101A (zh) * 2019-09-29 2019-12-03 广州视源电子科技股份有限公司 电子云台、视频会议系统及方法
CN111601069B (zh) * 2020-05-13 2021-12-07 中国三峡建设管理有限公司 智能会议系统
US11317973B2 (en) * 2020-06-09 2022-05-03 Globus Medical, Inc. Camera tracking bar for computer assisted navigation during surgery
CN111898553B (zh) * 2020-07-31 2022-08-09 成都新潮传媒集团有限公司 一种判别虚像人员的方法、装置及计算机设备
CN112084929A (zh) * 2020-09-04 2020-12-15 苏州科达科技股份有限公司 发言人识别方法、装置、电子设备、存储介质及系统
CN112911256A (zh) * 2020-12-29 2021-06-04 慧投科技(深圳)有限公司 一种具备摄像头自动捕捉发声源的投影机系统
CN112860198B (zh) * 2021-01-05 2024-02-09 中科创达软件股份有限公司 视频会议的画面切换方法、装置、计算机设备及存储介质
CN112511757B (zh) * 2021-02-05 2021-05-04 北京电信易通信息技术股份有限公司 一种基于移动机器人的视频会议实现方法及系统
CN113612961A (zh) * 2021-07-13 2021-11-05 杭州海康威视数字技术股份有限公司 画面输出控制方法、装置、设备及机器可读存储介质
TWI810798B (zh) * 2022-01-24 2023-08-01 瑞軒科技股份有限公司 視訊畫面構成方法以及電子裝置
CN117998110A (zh) * 2024-01-29 2024-05-07 广州开得联软件技术有限公司 分布式导播方法、装置、电子设备及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090154571A1 (en) * 2007-12-14 2009-06-18 Samsung Electronics Co. Ltd. Method and apparatus for video conferencing in mobile terminal
CN102843543A (zh) * 2012-09-17 2012-12-26 华为技术有限公司 视频会议提醒方法、装置和视频会议系统
CN104954730A (zh) * 2015-05-29 2015-09-30 华为技术有限公司 一种播放视频的方法及装置
CN105376515A (zh) * 2014-09-02 2016-03-02 华为技术有限公司 用于视频通讯的通讯信息的呈现方法、装置及系统

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6208373B1 (en) * 1999-08-02 2001-03-27 Timothy Lo Fong Method and apparatus for enabling a videoconferencing participant to appear focused on camera to corresponding users
US6766035B1 (en) * 2000-05-03 2004-07-20 Koninklijke Philips Electronics N.V. Method and apparatus for adaptive position determination video conferencing and other applications
CA2563478A1 (en) * 2004-04-16 2005-10-27 James A. Aman Automatic event videoing, tracking and content generation system
US8289363B2 (en) * 2006-12-28 2012-10-16 Mark Buckler Video conferencing
CN101472133B (zh) * 2007-12-28 2010-12-08 鸿富锦精密工业(深圳)有限公司 影像校正装置及影像校正方法
CN102204245A (zh) * 2008-11-04 2011-09-28 惠普开发有限公司 相对于摄像机位置控制视频窗口位置
US8358328B2 (en) * 2008-11-20 2013-01-22 Cisco Technology, Inc. Multiple video camera processing for teleconferencing
CN101534413B (zh) * 2009-04-14 2012-07-04 华为终端有限公司 一种远程呈现的系统、装置和方法
US9648346B2 (en) * 2009-06-25 2017-05-09 Microsoft Technology Licensing, Llc Multi-view video compression and streaming based on viewpoints of remote viewer
US8248448B2 (en) * 2010-05-18 2012-08-21 Polycom, Inc. Automatic camera framing for videoconferencing
US8842161B2 (en) * 2010-05-18 2014-09-23 Polycom, Inc. Videoconferencing system having adjunct camera for auto-framing and tracking
US8395653B2 (en) 2010-05-18 2013-03-12 Polycom, Inc. Videoconferencing endpoint having multiple voice-tracking cameras
US8665304B2 (en) * 2011-03-21 2014-03-04 Sony Corporation Establishing 3D video conference presentation on 2D display
CN108063910B (zh) * 2013-08-01 2021-03-19 波利康公司 用于视频会议系统中的摄像机底座及其方法
US9210379B2 (en) 2014-02-27 2015-12-08 Google Inc. Displaying a presenter during a video conference
CN106251334B (zh) * 2016-07-18 2019-03-01 华为技术有限公司 一种摄像机参数调整方法、导播摄像机及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090154571A1 (en) * 2007-12-14 2009-06-18 Samsung Electronics Co. Ltd. Method and apparatus for video conferencing in mobile terminal
CN102843543A (zh) * 2012-09-17 2012-12-26 华为技术有限公司 视频会议提醒方法、装置和视频会议系统
CN105376515A (zh) * 2014-09-02 2016-03-02 华为技术有限公司 用于视频通讯的通讯信息的呈现方法、装置及系统
CN104954730A (zh) * 2015-05-29 2015-09-30 华为技术有限公司 一种播放视频的方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3657781A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112689092A (zh) * 2020-12-23 2021-04-20 广州市迪士普音响科技有限公司 一种自动跟踪的会议录播方法、系统、装置及存储介质

Also Published As

Publication number Publication date
EP3657781B1 (en) 2022-07-27
US20200186649A1 (en) 2020-06-11
EP3657781A4 (en) 2020-07-15
US10873666B2 (en) 2020-12-22
CN109413359A (zh) 2019-03-01
EP3657781A1 (en) 2020-05-27
CN109413359B (zh) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2019033968A1 (zh) 摄像跟踪方法、装置及设备
US9883143B2 (en) Automatic switching between dynamic and preset camera views in a video conference endpoint
US9179098B2 (en) Video conferencing
US9030520B2 (en) Automatic camera selection for videoconferencing
US9633270B1 (en) Using speaker clustering to switch between different camera views in a video conference system
US8842161B2 (en) Videoconferencing system having adjunct camera for auto-framing and tracking
US8705778B2 (en) Method and apparatus for generating and playing audio signals, and system for processing audio signals
WO2015070558A1 (zh) 一种控制视频拍摄的方法和装置
US10447970B1 (en) Stereoscopic audio to visual sound stage matching in a teleconference
US11695900B2 (en) System and method of dynamic, natural camera transitions in an electronic camera
WO2010072075A1 (zh) 视频通信的方法、装置及系统
US20140063176A1 (en) Adjusting video layout
WO2012142975A1 (zh) 会场终端音频信号处理方法及会场终端和视讯会议系统
JP2009049734A (ja) カメラ付きマイクロフォン、カメラ付きマイクロフォンの制御プログラムおよびテレビ会議システム
EP4075794A1 (en) Region of interest based adjustment of camera parameters in a teleconferencing environment
US9706169B2 (en) Remote conference system and method of performing remote conference
WO2022051920A1 (en) Tracking with multiple cameras
KR101387542B1 (ko) 화상회의 장치
CN117319594A (zh) 会议人员追踪显示方法、装置、设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18846441

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018846441

Country of ref document: EP

Effective date: 20200219