WO2019011189A1

WO2019011189A1 - Audio and video acquisition method and apparatus for conference television, and terminal device

Info

Publication number: WO2019011189A1
Application number: PCT/CN2018/094807
Authority: WO
Inventors: 张泽良
Original assignee: 中兴通讯股份有限公司
Priority date: 2017-07-12
Filing date: 2018-07-06
Publication date: 2019-01-17
Also published as: CN109257558A

Abstract

An audio and video acquisition method for a conference television comprises: acquiring audio data obtained by an audio and video acquisition device by means of sound acquisition, and determining an audio and video source position of a speech in a conference site according to the audio data; moving, according to the audio and video source position, the position of the audio and video acquisition device to a sound acquisition position that satisfies a preset sound acquisition condition; and moving, according to the audio and video source position and the sound acquisition position, the position of the audio and video acquisition device to an image acquisition position that satisfies a preset image acquisition condition.

Description

Audio and video collection method, device and terminal device for conference television

Technical field

The present disclosure relates to, but is not limited to, the field of conference television systems, and in particular, to an audio and video collection method, apparatus and terminal device for conference television.

Background technique

In the meeting, there are often cases where the conference spokesperson is not in the camera collection range or the microphone collection sound is blurred. The camera can be adjusted or the microphone position can be adjusted by means of human participation, so that the camera or microphone can achieve the best video or audio. .

For example, in a small conference scene, the video is often only collected by one or two cameras. The audio is generally only collected by one microphone. The acquisition angle and position of the camera and the microphone are also preset.

Summary of the invention

The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

The audio and video collection methods in the above small conference scene can only ensure that the people in a specific location are in the preset image acquisition and sound collection. If other people attending the conference want to communicate, the image collection may not capture the speaker, and the sound collection is not clear. In other cases, at this time, the angle and position of the camera and microphone can only be adjusted by means of human participation, so that the speaker is in the state of preset image acquisition and sound collection.

The present disclosure provides an audio and video collection method, apparatus, and terminal device for a conference television, which can realize positioning a speaker according to the sound collection and automatically move to a preset position of the speaker audio and video collection.

An embodiment of the present disclosure provides an audio and video collection method for a conference television, where the method includes:

Obtaining audio data obtained by the audio and video collection device for sound collection, and positioning the audio and video source position of the speaking in the venue according to the audio data;

And moving the position of the audio and video collection device to a sound collection position that satisfies a preset condition of the sound collection according to the position of the audio and video source;

And moving the position of the audio and video collection device to an image collection position that satisfies an image acquisition preset condition according to the audio and video source position and the sound collection position.

The embodiment of the present disclosure further provides an audio and video collection device for a conference television, where the device includes:

The positioning module is configured to: obtain audio data obtained by the audio and video collection device for sound collection, and locate the audio and video source position of the speaking in the conference according to the audio data;

The sound collection position moving module is configured to: move the position of the audio and video collection device to a sound collection position that satisfies a sound collection preset condition according to the audio and video source position;

The image capturing position moving module is configured to: move the position of the audio and video collecting device to an image capturing position that satisfies an image capturing preset condition according to the audio and video source position and the sound collecting position.

Embodiments of the present disclosure also provide a terminal device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer The program implements the following steps:

Obtaining audio data collected by the sound, and positioning the audio and video source position of the speaking in the venue according to the audio data;

And moving the position of the audio and video collection device to the sound collection position that satisfies the preset condition of the sound collection according to the position of the audio and video source;

The embodiment of the present disclosure further provides a computer readable storage medium storing computer executable instructions, which are implemented to implement the audio and video collection method method of the conference television.

An audio and video collection method, apparatus, and terminal device for a conference television according to an embodiment of the present disclosure, by acquiring audio data collected by a sound, and locating an audio and video source position of a conference in the conference according to the audio data; according to the audio and video source Positioning, moving the position of the audio and video collection device to a sound collection position satisfying the sound collection preset condition; moving the position of the audio and video collection device to meet the image collection according to the audio and video source position and the sound collection position The image acquisition position of the preset condition. The audio and video collection method of the conference television provided by the embodiment of the present disclosure is based on sound recognition, sound collection, image acquisition, and audio and video collection through unmanned technology, compared to audio and video collection methods of conference television in the prior art. The movement of the device is configured once during the deployment of the conference television system. Later, the location of the speaker is located according to the voice collection of the conference speaker, so that the audio and video collection device is automatically approached to the speaker without excessive manual intervention by the participants. Automatically adjust to the sound collection position that satisfies the sound collection preset condition and the image acquisition position that satisfies the image acquisition preset condition, and realize the automatic adjustment of the audio and video collection position of the conference television to achieve the preset effect of the audio and video collection of the conference television. , reducing the labor cost of conference TV and improving the efficiency of video conferencing.

Adding a clue after the beneficial effects "Other aspects can be understood after reading and understanding the drawings and detailed description.

BRIEF abstract

FIG. 1 is a flowchart of a method for collecting audio and video of a conference television according to an embodiment of the present disclosure;

2(a) to 2(d) are schematic diagrams of audio and video collection of an audio and video collection method for a conference television for a small conference according to an alternative embodiment of the present disclosure;

3(a) to 3(f) are schematic diagrams of audio and video collection of an audio and video collection method for a conference television for a large conference according to an alternative embodiment of the present disclosure;

FIG. 4 is a block diagram of a program module of an audio and video collection device for a conference television according to an embodiment of the present disclosure.

Preferred embodiment of the present disclosure

Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

An audio and video collection method for a conference television provided by an embodiment of the present disclosure may be applied to a conversation television system, and the conference television system may include:

(1) Display device. The display device is configured to: display a site audio and video, such as a display screen, etc., on the local end or the off-site remote end in the video conference field;

(2) Video terminal equipment. The video terminal device is configured to: receive and output audio and video of the conference television collected by the audio and video collection device, and control and manage the movement of the audio and video collection device, such as a computer host;

(3) Mobile audio and video collection equipment. The audio and video capture device is equipped with a video and audio capture device, and the audio and video capture device can be implemented by using an unmanned mobile technology, such as a mobile audio and video capture device having the same or similar principle as the drone, and the audio and video capture device can be Suspended in midair and moved in the air.

The mobile audio and video capture device can be connected to the video terminal device by using a wireless technology such as Wireless Fidelity (WIFI) or Bluetooth, and the audio and video image data collected by the movable audio and video capture device can pass through. Techniques such as WIFI or Bluetooth are transmitted to the video terminal device.

The audio and video collection device can collect the sound data and the video image data of the conference site by using the video and audio collection device, and send the video data to the video terminal device, where the video terminal device can receive the venue collected by the audio and video collection device. The audio data is processed and displayed by the display device after the video and audio data are processed.

Referring to FIG. 1 , FIG. 1 is a flowchart of a method for collecting audio and video of a conference television according to an embodiment of the present disclosure, where the method may include:

S100: Acquire audio data obtained by the audio and video collection device for sound collection, and locate an audio and video source position of the speaking in the venue according to the audio data.

Optionally, before step S100, the method further includes:

At the beginning of the conference, the position of the audio and video capture device is moved to a preset initial position of the conference site.

Optionally, when the conference television system is disposed, the preset initial position of the audio and video collection device in the conference site is pre-configured.

For a small conference site that requires only one of the audio and video capture devices, the audio and video capture device that is equipped with the video and audio capture device can be configured in the central location of each site, that is, the audio and video capture device is set in the conference site. In the middle, when there is a conference table in the conference hall, since the conference table of the conference room can generally be placed in the middle of the conference venue, the conference table can be used as a reference coordinate, and the audio and video collection device can be disposed in the middle of the conference table.

For a large conference site, since one of the audio and video capture devices can only cover a certain range, for example, each of the audio and video capture devices can cover a range of 6 meters in diameter, that is, a coverage radius of 3 meters, which may require more The audio and video collection device of the station is configured to arrange an appropriate number of the audio and video collection devices according to the size of the site, so that the audio and video collection device covers the entire site. Starting from a certain position, for example, starting from a position close to the podium, the audio and video collection devices are arranged one by one with a preset diameter range, for example, every 6 meters, and are arranged in order until the entire site is covered.

When the video conference is not performed, the audio and video collection device can be placed in an appropriate position in the conference site according to the site situation. In particular, when a large conference site has a plurality of the audio and video collection devices, when the video conference is not performed, the plurality of audio and video collection devices may be placed in a proper position in the conference site.

When the video conference starts, the conference television system can be activated to enable the terminal device, and the audio and video capture device can automatically move to a preset initial position in the conference site. For example, there is only one small conference site of the audio and video capture device, and after receiving the instruction of the conference television system, the audio and video capture device may trigger an initial instruction to move to the middle position of the conference table, so that the audio and video capture device is enabled. Startup, suspended in the middle of the conference table, in the middle of the conference table; when a large conference venue with a plurality of the audio and video collection devices is arranged, after the large conference venue television terminal is turned on, after each of the audio and video collection devices is activated, Each of the venues can be moved to the preset location of the corresponding site configured when the conference television system is deployed, so that the entire conference site is within the coverage of the audio and video collection device.

The audio and video capture device can start collecting audio data according to the sound, and the audio and video capture device can obtain the audio data, and the audio and video capture device can obtain the audio data. Positioning the audio and video source of the speech, that is, sound localization, according to the audio data.

S200: Move a position of the audio and video collection device to a sound collection position that satisfies a sound collection preset condition according to the audio and video source position.

Optionally, after the audio and video collection device detects the location of the speaker in the conference site by using sound positioning, the audio and video collection device may automatically move to the position of the speaking position to satisfy the sound collection preset condition. The sound captures the preset position, so that the audio and video capture device is close to the speaking position, so that the audio and video capture device is in a better position for sound collection. The sound collection preset position may generally be an arc-shaped area from a speaking position of 1 to 1.5 meters, that is, after positioning the speaking position, the audio-video collecting device can automatically move to a position of 1 to 1.5 meters close to the speaking position. Inside an arc.

For example, in a small conference site, since the audio and video collection device can be moved, generally only one of the audio and video collection devices is required to meet the demand. Generally, the conference table of the small conference site can be less than 5 meters in length, and one of the sounds is The video capture device can cover a diameter of 6 meters. Although the conference table is within the coverage of the audio and video capture device, the collected audio and video effects are not necessarily the best, so the audio and video can be dynamically adjusted according to the speaking position. Collecting the device so that the audio and video capture device is in a suitable position of the speaker, so that the audio and video capture device is in a preset position for audio and video capture, so that the audio and video collected by the audio and video capture device reaches a preset state. Clear effect, the clear effect is generally achieved by the movement of the acquisition position.

After the audio and video collection device collects the audio data of the conference site, the audio data may be analyzed to determine whether the audio data is smaller than a preset audio data threshold, and the distance of the speaking position and the direction of the speaking position may be determined. When the audio data is smaller than a certain preset audio threshold, the audio and video collection device may be determined to be too far away from the speaker, so that the collected audio data is too small, and the position of the audio and video collection device may be moved in the speaking direction. Close to the speaker.

When it is determined that the position of the audio and video collection device needs to be adjusted, the audio collected by the audio and video collection device can be brought to an audio preset condition by using the position of the audio and video collection device.

When the voice of the conference site speaker obtained by the analysis is less than a preset sound decibel threshold, for example, the site has two sound sources A and B, and the audio and video collection device can detect that the current sound source A is less than 600 Hz (hertz) ), while other sound sources B are greater than 700 Hz and 25 decibels or more, the self-tuning system can lock the position of the current sound source B, and the audio and video capture device can automatically move to the position determined by the sound source B, so that the audio and video The audio collected by the acquisition device reaches the audio preset condition. The audio preset condition may mean that the audio decibel is higher than a certain preset threshold, so that the collected conference site sound is sufficiently clear.

When the audio and video collection device determines that the position for audio and video collection is not required to be adjusted, the current location may be the default preset position for sound collection in the audio and video collection.

S300. Move the position of the audio and video collection device to an image collection location that satisfies an image acquisition preset condition according to the audio and video source location and the sound collection location.

After the audio and video collection device moves to a preset position for performing sound collection, the sound collection effect is met, and the audio and video collection device can acquire video image data of the site, and process the video image data. And processing the processed video image data through a display device.

The audio and video capture device can obtain the video image data of the site according to the preset position of the current sound collection, and determine whether the video image captured by the audio and video capture device at the sound capture preset position satisfies a preset video image preset condition. , that is, whether the captured video image is clear enough. When it is determined that the captured video image cannot meet the preset video preset condition, the audio and video collection device may determine, according to the speaking position and the sound collection location, that the audio and video collection device that needs to be adjusted satisfies the image collection. Presetting the image capturing position of the condition and moving to the image capturing position, and moving the position of the audio and video collecting device to make the video image collected by the audio and video collecting device reach a preset condition for video image capturing; When the captured video image meets the preset video preset condition, the audio and video capture device may default to the current position as the video image capture preset position of the audio and video capture device.

Optionally, when the audio and video collection device determines that the location of the video image collection needs to be adjusted, for example, if the video image resolution or resolution of the obtained conference site speaker is less than a certain preset video image threshold, the video may be passed through the video. The self-adjusting system determines the preset position of the video image acquisition, so that the position of the audio and video collection device moves within the preset range of the sound collection, and moves to a preset video collection point of the video image collection distance speaking position, close to the speaker, The video captured by the audio and video capture device reaches a preset condition for video image acquisition, and the preset condition of the video image capture may be that the video resolution is higher than a certain preset threshold or the video resolution is higher than a certain preset. Set the threshold, for example, the captured face image is in the middle of the video screen, so that the collected conference scene sound is clear enough.

The video self-adjusting system may be that the audio and video collection device determines the current arc-shaped position range according to the speaker's sound size (for example, greater than 700 Hz and 25 decibels or more), and determines the face image according to the video acquisition sensor. The best principle (face in the middle of the image 2/5 to 3/5, nose on the intermediate axis) to determine the best point on the current arc line, and according to the best point confirmed and the meeting site received The current video image of the speaker is used to obtain the moving direction and moving distance of the audio and video collecting device.

When the audio and video collection device determines that it is not necessary to adjust the image collection position of the audio and video collection device, the audio image acquisition device that presets the current position is the preset optimal position for the video image of the speaking position.

Optionally, after the step of moving the position of the audio and video collection device to the preset initial position of the site, the method further includes:

When the conference collects audio and video in the order of speaking, the audio and video source position of the speech is received, and the position of the audio and video collection device is moved according to the position of the audio and video source to the sound collection position and image collection that meet the preset conditions of the sound collection. The image acquisition position of the preset condition.

Since the executive meeting is generally carried out according to a fixed order of speech and speaking time, that is, the speaking order and the speaking time of each speaking position are preset in a fixed order and time, so the video conference is the voice of the video conference during the executive meeting. The video capture may also be to collect audio and video according to the order of speaking. According to the preset speaking position order and the preset speaking time of each speaking position, the video conference system may be configured to automatically recognize the position of the sound and the image according to the speaker sequence. The video terminal device may pre-deliver the location of the conference speaker according to the current speaker, and the audio and video collection device may move the location according to the coordinates sent by the video terminal device, that is, the video terminal device may directly send the sound to the voice. The instruction of the video capture device to the current speaking position causes the audio and video collection device to move to the corresponding speaking position at a preset time.

Optionally, after the step of moving the position of the audio and video collection device to the image collection location that meets the image acquisition preset condition, according to the audio and video source location and the sound collection location, the method further includes:

When the audio data is not received within the preset time threshold, the position of the audio and video capture device is moved back to the preset initial position.

Optionally, when the audio data of the audio and video collection device is not received within a preset time threshold, for example, within 30 seconds, the current speaker may be judged to be finished, and the audio and video collection device may be Move back to the preset initial position. For example, if there is only one small conference site of the audio and video capture device, when the speaker's speech ends, the audio and video capture device can automatically move to the middle position of the conference site, that is, the middle position of the conference table, so that when there are other people When speaking, the audio and video collection device can move to the vicinity of the speaking position to perform audio and video collection.

Since the embodiment of the present disclosure adopts a movable audio and video collection device, the audio and video collection device can be suspended in midair, so the audio and video collection device can be moved to an upper middle position of the venue.

Optionally, when the video conference ends, the audio and video capture device may receive a video conference system shutdown command, and the audio and video capture device may automatically move to a pre-designated location of the conference site and place it at a location in the conference site. In particular, when a large conference site has a plurality of the audio and video collection devices, when the video conference is not performed, the plurality of audio and video collection devices can be placed in a free place in the conference site.

Optionally, the step of moving to a sound collection location that meets a sound collection preset condition according to the audio and video source location further includes:

When it is detected that at least two people are speaking at the same time, according to the audio and video source position of each speech of the simultaneous speech, the geometrical image is formed with the audio and video source position of the simultaneous speech as a vertex to be relative to the geometric center The position within the threshold range is set as the target position, and the position of the audio and video collection device is moved to the target position.

Optionally, when it is detected that at least two people speak at the same time, a target location may be acquired according to each speaking position of the simultaneous speaking, and the target location may simultaneously take into account each of the speaking positions of the speaking, and may move to The target location.

Specifically, when it is detected that at least two people speak, the position of the speaker may be located according to the sound, and a target position may be determined according to the position of the speaker, and the target position may simultaneously take into account each of the speaking positions of the speaking. And moving to the target location, so that the audio and video device can take care of each of the speakers when collecting audio and video.

For example, if the speaker is speaking, the speech is interrupted by another person, or at least two people need to take an interactive speech, the audio and video collection device can receive audio data of at least two people. The audio and video collection device may determine an intermediate position according to the location of the speaker according to the position of the sound, and the audio and video collection device may automatically move to a preset collection position among the speakers, the speaker The preset position of the intermediate acquisition is optionally within a preset threshold range of the center of the polygon formed by the edge of the speaker with the speaker as the apex, that is, the position of the audio and video source simultaneously speaking. Forming a geometry for the vertices, moving the position of the audio-video capture device to the target position with the position within the preset threshold relative to the center of the geometric shape as a target position, for example, when two people speak at the same time , which is within a preset threshold range relative to the midpoint of the two people. If three people speak at the same time, it is relative to the center of the triangle. Disposed within the threshold range, since the center is a point, and the audio and video capture position of a range, with respect to the geometry within a preset threshold range of the center position, as is the position of the audio and video capture.

Receiving an instruction to adjust an audio and video collection position, and moving the audio and video collection position of the audio and video collection device according to the instruction.

When the audio and video collection device receives an instruction to adjust an audio and video collection position, the audio and video collection device may move the position according to the instruction, so that the audio and video collected by the audio and video collection device reaches an audio preset condition and Video image preset condition.

For example, when the video conference is displayed on the local end or the remote end, if the local or remote end finds that the video or audio is not clear enough, the local end and the remote end can control the audio and video capture device by using an infrared remote control. The audio and video collection device sends an instruction for adjusting, and when the audio and video collection device receives the instruction for adjusting the position, the audio and video collection device may move the position according to the instruction, so that the audio and video collection device collects The audio reaches the audio preset condition or the video image reaches the video image preset condition.

Optionally, the method further includes: when at least two of the audio and video collection devices are included, one of the audio and video collection devices is selected as the main audio and video collection device, and the other audio and video collection devices in the conference are controlled by the main audio and video collection device.

When at least two of the audio and video collection devices are included, one of the audio and video capture devices may be selected as the main audio and video capture device, and the other audio and video capture devices are the audio and video capture devices, and the audio and video capture device coordinates the audio and video. The collecting device, wherein the main audio and video collecting device can collect the audio and video of the venue and coordinate the audio and video collecting device of the audio and video collecting device, and other audio and video collecting devices that are responsible for optimizing the video and audio collection position of the venue and even the optimal audio and video capturing device can For collecting audio and video equipment.

In general, the audio and video capture device near the rostrum can be the main, that is, the main audio and video capture device, and the other audio and video capture devices can be the audio and video capture device, and the main audio and video capture device can control other such slaves. The audio and video collection device reaches the collection range and the switching device transmits to the conference television terminal. After the audio and video collection device is adjusted from the audio and video collection device, the collected audio and video can be transmitted to the main audio and video collection device through the speaker. Conference TV terminal.

Optionally, in a large conference site, when there are multiple audio and video collection devices, one of the plurality of audio and video collection devices may be set as the main audio and video collection device, and other audio and video collection devices may be used. The device is an audio and video collection device, and the data communication between the audio and video collection device and the video terminal device is implemented by the primary audio and video collection device. Optionally, the main audio and video collection is set close to the podium. Equipment, others can be from audio and video capture equipment.

Optionally, after the large-scale conference video terminal is turned on, the preset video, the preset position, that is, the preset position, the preset position, the audio and video collection, may be sent by the video and audio collection device. The device can be moved to a preset initial position, that is, a preset position, in the conference site according to the location delivered by the main audio and video capture device.

Optionally, the large conference site may coordinate a plurality of the slave audio and video collection devices by using the primary audio and video capture device, where the slave audio and video capture device may be responsible for optimizing or optimizing the video and audio collection locations, and obeying the The final location of the main audio and video capture device. The main audio and video capture device may select a single or a plurality of the audio and video capture devices to synthesize an audio and video source as an input, and select only the current speaker as the sound source.

For example, the video terminal device receives the collected audio and video image data and processes it. If the local or remote end finds that the video or audio is not clear enough, the audio and video collection device may be fed back to adjust, when the video terminal device receives the video terminal device. When the instruction for adjusting the position of the audio and video collection device is adjusted, if the video terminal device is in a large conference site, the video terminal device may be sent to the primary audio and video capture device, and then the audio and video capture device may be adjusted by the primary audio and video capture device. The location of the device.

Optionally, when the audio data is not received within a preset time threshold, the position of the audio and video collection device can be moved back to the preset initial position. If it is in a large conference site, the position of the audio and video collection device can be uniformly scheduled by the main audio and video collection device.

Optionally, when the local end or the remote end sends an instruction to control the slave audio and video capture device, the video terminal device is also sent to the primary audio and video capture device, and then the control device can be controlled by the primary audio and video capture device. Described from audio and video collection equipment.

In the large-scale conference, if the speaker's speech ends, the audio-video collection device may be moved to the preset initial position corresponding to the conference site, and the slave audio-video capture device may be uniformly scheduled by the master audio-video collection device. .

In a large conference site, if the speech is interrupted by another person, the information collected by the audio and video collection device to collect the position coordinates and the sound may be sent to the main audio and video collection device by the main audio and video collection device. After the information collected by the audio and video collection device is processed, the slave audio and video collection device is separately scheduled according to the processing result.

In a large conference site, when the conference configuration is to identify the location of the sound image in the order of the speaker, the video terminal device may pre-deliver the location of the conference speaker to the primary audio and video collection device according to the current speaker. The device may control its own location or other one or more of the slave audio and video capture device mobile locations according to coordinates sent by the video terminal device, and may select audio and video collected by a single or multiple of the audio and video capture devices. Combine an audio and video source as input.

The above process is described in detail below with an optional application embodiment.

Before the conference starts, one or more automatic audio and video capture devices can be pre-configured according to the size of the venue. After the conference television system terminal is powered on, a collection position may be preset according to the number of pre-configured automatic audio and video acquisition devices, as shown in FIG. 2(a) and FIG. 3(a), respectively.

(1) Small venue scene

Please refer to FIG. 2(a) to FIG. 2(d). FIG. 2(a) to FIG. 2(d) are schematic diagrams of small conference acquisition of a monophonic video capture device.

1) Please refer to FIG. 2(a). At the beginning of the conference, the audio and video collection device can be preset in the middle of the conference site. When a local person 1 speaks at the conference site, the location of the speaker 1 can be located by voice, and the audio and video collection is performed. The device is close to the speaker 1 while the image is adjustable to optimize or even optimal, and then the image is sent to the video terminal device.

If the video terminal device requests to adjust the audio and video collection device, the audio and video collection device may perform corresponding processing according to the coordinate optimal principle sent by the video terminal device; if the video terminal device does not request to adjust the tone For the video capture device, the audio and video capture device may default to the current location, as shown in Figure 2(b).

2) If two people (speaker 1 and speaker 2) speak at the same time, the audio and video capture device can automatically move to the preset position of two people according to the sound collection coordinates, and generally can be oriented to two people. Direction, and in the arc area where the center of the two people is connected.

If the video terminal device requests to adjust the audio and video collection device, the audio and video collection device may perform corresponding processing according to the coordinate optimal principle sent by the video terminal device; if the video terminal device does not request to adjust the tone For the video capture device, the audio and video capture device may default to the current location, as shown in Figure 2(d).

3) If the conference is configured to be collected according to the speaker priority principle, the video terminal device may send the location coordinates of the current speaker, and the audio and video capture device may move according to the coordinates, and then automatically set itself according to its own audio and video. The position adjustment optimization is optimized to send the data obtained by the audio and video collection to the video terminal device. As shown in FIG. 2(b), the speaker 1 is currently speaking. The audio and video collection device can move to the location of the speaker 1 according to the received coordinates and optimize or optimize the position adjustment. When the next speaker 2 speaks, the video terminal device can send the position coordinates of the current speaker 2, and the audio-video collecting device can move to the speaker 2 collection position and optimize and even optimize the position adjustment, as shown in the figure. 2(c).

4) If it is a discussion session, you can make a fine-tuning based on the strength of the sound. As shown in FIG. 2(b), when it is detected that the current speaker 1 has the strongest sound, the audio-video collecting device can move to the speaker 1 to collect optimization or even the optimal position. If it is detected that the voices of the speaker 1 and the speaker 2 are consistent, the device can return to the collection position as shown in FIG. 2(d), or the location of the video terminal device can be used as the optimal location for collection. .

At the end of the discussion, the default is to return to the 2(a) position.

(2) Large-scale venue scene

Referring to FIG. 3( a ) to FIG. 3( f ), FIG. 3( a ) to FIG. 3( f ) are schematic diagrams of large-scale conference collection of a plurality of the audio and video collection devices, which may be illustrated in the drawing. It is short for audio and video capture equipment.

1) The terminal device is turned on, and the audio and video collection device can be preset at a specific location of the site, as shown in FIG. 3(a).

2) At the beginning of the conference, the audio and video capture device can be preset to the default initial position of the conference site, as shown in Figure 3(b).

When a local speaker speaks, the main audio and video capture device 1 can locate the location of the speaker by analyzing each voice reported from the audio and

video capture devices

2, 3, and 4, and the master audio and video capture device 1 can move to the nearest one. Or the plurality of instructions from the audio and video collection device are sent to the speaker, and the main audio and video capture device 1 can teach one or more audio and video capture devices to optimize (synthesized or non-synthesized) images and sounds. Optimal, the process can include:

(1) As shown in FIG. 3(c), the speaker 2 starts to speak, and if it is separately collected from the audio-video collecting device 3, the data can be directly sent to the main audio-video collecting device 1, and the main audio-video collecting device 1 is sent to the video terminal device.

If the video terminal device of the conference television requests to tune the slave audio and video capture device 3, the video terminal device may send the coordinates to the master audio and video capture device 1, and the master audio and video capture device 1 uniformly allocates the From the audio and video collection device 3, the slave audio and video collection device 3 can perform corresponding processing according to the coordinates sent by the video terminal device; if the video terminal device of the conference television does not request to adjust the slave audio and video collection device 3, Then, the slave audio and video collection device 3 can default to the current location.

(2) As shown in FIG. 3(d), the speaker 2 starts to speak. If there are multiple cooperative acquisitions (from the audio and video collection device 2 and the audio and video collection device 3), the slave audio and video collection device 2 can be And the data collected from the audio-video collecting device 3 is sent to the main audio-video collecting device 1 for image and sound synthesis, and then the main audio-video collecting device 1 can transmit the synthesized data to the video terminal device of the conference television.

If the video terminal device of the conference television requests to tune from the audio and video collection device 2, the video terminal device may send the coordinates to the primary audio and video collection device 1, and the primary audio and video collection device 1 uniformly allocates the slave audio/video collection device 1 to the slave The audio/video collecting device 2, the slave audio/video collecting device 2 or the audio/video collecting device 3 can perform corresponding processing according to the coordinates sent by the video terminal device; if the video terminal device of the conference television requests tuning from the audio and video collection Device 3, the video terminal device sends the coordinates to the main audio and video collection device 1, and the main audio and video capture device 1 is uniformly allocated to the slave audio and video capture device 3, and the slave audio and video capture device 3 can be The coordinates sent by the video terminal device are processed accordingly; if the video terminal device of the conference television does not request to tune from the audio and video collection device 2, the slave audio/video collection device 2 may default to the current location; if the conference television The video terminal device does not request to tune from the audio and video collection device 3, and the slave audio and video collection device 3 can default to Optimal location.

(3) If there are many people speaking at the same time (Speaker 2 and Speaker 3), as shown in Figure 3(c), the lead audio and video capture device 1 can collect from the most recent one or more slave audio and video. The device (for example, from the audio and video capture device 2, from the audio and video capture device 3 or from the audio and video capture device 4) issues an instruction near the speaker, and the primary audio and video capture device 1 is adjustable (from the audio and video capture device 2) From the audio and video capture device 3 or from the audio and video capture device 4), the composite image and sound are optimized or optimized (from the audio and video capture device 2, from the audio and video capture device 3 or from the audio and video capture device 4) The image is sent to the main audio and video capture device 1, and after the main audio and video capture device 1 synthesizes the processed image data, the combined image data is sent to the video terminal device.

As shown in FIG. 3(d), if the video terminal device requests the tuning audio and video collection device, the video terminal device may send the coordinates to the primary audio and video collection device 1, and the primary audio and video collection device 1 uniformly allocates From the audio and video capture device 2, from the audio and video capture device 3 or from the audio and video capture device 4 position coordinates; if the video terminal device does not request the tuning audio and video capture device, the master and slave audio and video capture devices may default to the current The position is optimal.

(4) If the conference is configured to be collected according to the speaker priority principle, the video terminal device can deliver the location coordinates of the current speaker 1, as shown in FIG. 3(e): the master audio and video capture device 1 can move according to coordinates. To the current speaker 1 location, if the primary audio and video capture device 1 needs to assist the acquisition from the audio and video capture device 2 and the audio and video capture device 3, the command can be sent to the audio and video capture device 2 and the audio and video capture device. 3, and then from the audio and video collection device 2 and the audio and video collection device 3 can automatically adjust its own position according to its own audio and video optimization or even the best to send the audio and video acquisition of the image and audio data to the main audio and video acquisition device 1. The main audio and video capture device 1 can synthesize the image and process the audio to transmit the combined image data and the processed audio data to the video terminal device.

If the current speaker 2 speaks, as shown in FIG. 3(f): the main audio and video capture device 1 can move to the position of the current speaker 2 according to the coordinates, if the main audio and video capture device 1 needs to be from the audio and video capture device 2 And assisting the acquisition from the audio and video collection device 3, and issuing commands to the audio and video collection device 2 and the audio and video collection device 3, and then from the audio and video collection device 2 and the audio and video collection device 3 can automatically according to their own audio and video The image and audio data obtained by the audio and video acquisition are sent to the main audio and video collection device 1 after adjusting the position optimization and the optimization. The main audio and video collection device 1 combines the image and processes the audio to obtain the synthesized image. The data and the processed audio data are transmitted to the video terminal device.

(5) At the end of the video conference, the audio and video capture device can be moved to the predetermined location by default, as shown in Figure 3(a).

The audio and video collection position locating method of the conference television provided by the embodiment of the present disclosure is based on the movable audio and video collection device implemented by the unmanned mobile technology, and the location of the location speaker is collected according to the voice of the conference speaker, and the audio and video collection device Automatically close to the speaker, and automatically adjust the sound collection position that meets the sound collection preset condition and the image collection position that meets the image acquisition preset condition, and realize the automatic adjustment of the audio and video collection position of the conference television. The audio and video collection method of the conference television in the prior art is based on voice recognition, and the sound collection, the image acquisition, and the movement of the movable audio and video collection device are configured once, and then automatically collected according to the conference television system. The system automatically adjusts the collection effect, and can automatically adjust the audio and video collection to achieve the preset effect without excessive manual intervention by the participants, reducing the labor cost of the conference television and improving the efficiency of the video conference.

Referring to FIG. 4, an embodiment of the present disclosure further provides an audio and video collection device for a conference television, where the device may include:

The positioning module 10 is configured to: obtain audio data obtained by the audio and video collection device for sound collection, and locate the audio and video source position of the speaking in the conference according to the audio data.

Optionally, when the video conference starts, the conference television system is started, and the terminal device is enabled, and the audio and video collection device can be automatically moved to a preset initial position in the conference site, because the audio and video collection device is equipped with video and audio collection. The audio and video collection device may start to collect audio data and video image data when the audio and video collection device obtains the audio data, and may locate the speaking position, that is, the sound localization, according to the audio data.

The sound collection position moving module 20 is configured to: according to the audio and video source position, move the position of the audio and video collection device to a sound collection position that satisfies a sound collection preset condition.

Optionally, after the audio and video collection device detects the location of the speaker in the conference site by using the sound data, the audio and video capture device can automatically move to the position of the speaking position to satisfy the sound collection preset. The conditional sound captures the preset position, so that the audio and video collection device is close to the speaking position, so that the audio and video collection device is in a better position for sound collection.

The image capturing position moving module 30 is configured to: move the position of the audio and video capturing device to an image capturing position that satisfies an image capturing preset condition according to the audio and video source position and the sound collecting position.

Optionally, the audio and video collection device may obtain the video image data of the site according to the preset position of the current sound collection, and determine whether the video image captured by the audio and video capture device at the sound collection preset position satisfies the preset video. The image preset condition, that is, whether the captured image is clear enough. When it is determined that the captured image cannot meet the preset video image preset condition, the audio and video collection device may determine, according to the speaking position and the sound collection location, that the audio and video collection device that needs to be adjusted satisfies the image collection. Presetting the image capturing position of the condition and moving to the image capturing position, and moving the position of the audio and video collecting device to make the video image collected by the audio and video collecting device reach a preset condition for video image capturing.

Optionally, the device further includes:

The location initial module is configured to: when the conference starts, move the position of the audio and video capture device to a preset initial position of the conference site.

Optionally, when the video conference starts, the conference television system is started, and the terminal device is turned on, and the audio and video collection device can be automatically moved to a preset initial position in the conference site.

The position moving back module is configured to: when the audio data is not received within the preset time threshold, move the position of the audio and video collection device to return to the preset initial position.

Optionally, when the audio and video collection device does not receive the audio data within a preset time threshold, for example, within 30 seconds, the current speaker may be judged to be finished, and the audio and video collection device may be moved. Go back to the preset initial position.

Embodiments of the present disclosure also provide one or more non-transitory computer readable storage media storing computer-executable instructions, when the computer-executable instructions are executed by one or more processors, wherein the one Or a plurality of processors perform the steps of the method.

Embodiments of the present disclosure also provide a terminal device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer The steps of the method are implemented at the time of the program.

Embodiments of the present disclosure also provide a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer The steps of the method are implemented at the time of the program.

One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium, such as the present disclosure. In an embodiment, the program can be stored in a storage medium of the computer system and executed by at least one processor in the computer system to implement a process comprising an embodiment of the method described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and functional blocks/units of the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media. The computer storage medium includes, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), and Electrically Erasable Programmable Read-only Memory (EEPROM). Flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cassette, magnetic tape, disk storage or other magnetic storage device, or Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically includes computer readable instructions, data structures, program modules or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .

A person skilled in the art can understand that the technical solutions of the present disclosure may be modified or equivalent, without departing from the spirit and scope of the present disclosure, and should be included in the scope of the claims of the present disclosure.

Industrial applicability

Claims

An audio and video collection method for a conference television, the method comprising:

Obtaining audio data obtained by the audio and video collection device for sound collection, and positioning the audio and video source position of the speaking in the venue according to the audio data;

And moving the position of the audio and video collection device to a sound collection position that satisfies a sound collection preset condition according to the audio and video source position;

And moving the position of the audio and video collection device to an image collection position that satisfies an image acquisition preset condition according to the audio and video source position and the sound collection position.
The method according to claim 1, wherein the step of acquiring the audio data collected by the sound, and locating the position of the audio and video source speaking in the venue according to the audio data, further includes:

At the beginning of the conference, the position of the audio and video capture device is moved to a preset initial position of the conference site.
The method of claim 2, after the step of moving the position of the audio and video collection device to the preset initial position of the site, the method further includes:

When the conference collects audio and video in the order of speaking, the audio and video source position of the speech is received, and the position of the audio and video collection device is moved according to the position of the audio and video source to the sound collection position and image collection that meet the preset conditions of the sound collection. The image acquisition position of the preset condition.
The method according to claim 2, after the step of moving the position of the audio-video collecting device to the image capturing position satisfying the image capturing preset condition according to the audio-video source position and the sound collecting position, Also includes:

When the audio data is not received within the preset time threshold, the position of the audio and video capture device is moved back to the preset initial position.
The method according to claim 1, wherein the step of moving the position of the audio/video collecting device to the sound collecting position satisfying the sound collecting preset condition according to the position of the audio and video source comprises:

When it is detected that at least two people are speaking at the same time, according to the audio and video source position of each speech of the simultaneous speech, the geometrical image is formed with the audio and video source position of the simultaneous speech as a vertex to be relative to the geometric center The position within the threshold range is set as the target position, and the position of the audio and video collection device is moved to the target position.
The method according to claim 1, after the step of moving the position of the audio-video collecting device to the image capturing position satisfying the image capturing preset condition according to the audio-video source position and the sound collecting position, Also includes:

Receiving an instruction to adjust an audio and video collection location, and moving the audio and video collection location of the audio and video collection device according to the instruction.
A method according to any one of claims 1 to 6, further comprising:

When at least two of the audio and video collection devices are included, one of the audio and video capture devices is selected as the main audio and video capture device, and the other audio and video capture devices in the conference are controlled by the main audio and video capture device.
An audio and video collection device for a conference television, the device comprising:

The positioning module is configured to: obtain audio data obtained by the audio and video collection device for sound collection, and locate the audio and video source position of the speaking in the conference according to the audio data;

The sound collection position moving module is configured to: move the position of the audio and video collection device to a sound collection position that satisfies a sound collection preset condition according to the audio and video source position;

The image capturing position moving module is configured to: move the position of the audio and video collecting device to an image capturing position that satisfies an image capturing preset condition according to the audio and video source position and the sound collecting position.
The apparatus of claim 8 further comprising:

The location initial module is configured to: when the conference starts, move the position of the audio and video collection device to a preset initial position of the conference site;

The position moving back module is configured to: when the audio data is not received within the preset time threshold, move the position of the audio and video collection device to return to the preset initial position.
A terminal device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the computer program when implementing the claim 1 The steps of any of the methods of any of 7.