CN111163281A

CN111163281A - Panoramic video recording method and device based on voice tracking

Info

Publication number: CN111163281A
Application number: CN202010021698.0A
Authority: CN
Inventors: 蒋灏; 李虎; 赵成斌; 沈宏泰; 田晟浩; 张小博; 穆永鹏; 戴玉成; 孙洁
Original assignee: Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2020-05-15

Abstract

The invention relates to a panoramic video recording method and a panoramic video recording device based on voice tracking, wherein a plurality of paths of audio signals and a plurality of paths of video signals are collected, and the plurality of paths of video signals are fused and spliced through a panoramic video to form a panoramic video image; estimating the sound source direction of a live speaker in real time according to the audio signal; intercepting a live speaker close-up image at a corresponding position in the panoramic video image according to the sound source direction, and integrating the live speaker close-up image and the panoramic video image to form a panoramic video output image; and uploading the audio signal and the panoramic video output image to an upper computer through a network or directly outputting the audio signal and the panoramic video output image through monitoring equipment. The method has simple flow, can effectively realize automatic generation of the panoramic image and the close-up image, and has real-time property.

Description

Panoramic video recording method and device based on voice tracking

Technical Field

The invention relates to a panoramic video recording method and device based on voice tracking.

Background

In the prior art, most video conference equipment for panoramic video is complex in composition, manual switching is needed for recording and broadcasting of speakers, and automatic generation of panoramic images and close-up images cannot be realized. The prior art most relevant to the invention is a patent of invention named as a conference transcription system based on a panoramic camera and a microphone array (patent publication No. CN 109474797A), and the technical scheme has the defects of complex structure, complex flow of generating a panoramic image and an automatic close-up image and poor real-time property.

Disclosure of Invention

The invention aims to provide a panoramic video recording method and device based on voice tracking, which can effectively realize automatic generation of panoramic images and close-up images.

Based on the same inventive concept, the invention has two independent technical schemes:

1. a panoramic video recording method based on voice tracking is characterized by comprising the following steps:

step 1: collecting a plurality of paths of audio signals and a plurality of paths of video signals, and fusing and splicing the plurality of paths of video signals through panoramic videos to form a panoramic video image;

step 2: estimating the sound source direction of a live speaker in real time according to the audio signal; intercepting a live speaker close-up image at a corresponding position in the panoramic video image according to the sound source direction, and integrating the live speaker close-up image and the panoramic video image to form a panoramic video output image;

and step 3: and uploading the audio signal and the panoramic video output image to an upper computer through a network or directly outputting the audio signal and the panoramic video output image through monitoring equipment.

Further, step 3 further comprises: carrying out face recognition on the close-up image of the live speaker to recognize the identity of the speaker; and identifying the audio signal, converting the voice into characters, storing the data, and labeling the identity of the speaker to the data.

Further, the multi-channel audio signal is collected by a microphone array, and the multi-channel video signal is collected by a multi-channel video sensor.

Furthermore, the microphone array is composed of a plurality of microphones, wherein 1 microphone is positioned at the position of the circle center, and the rest microphones are uniformly distributed along the circumferential direction;

the multiple paths of video sensors are uniformly distributed along the circumferential direction;

the number and the position distribution of the microphones and the video sensors are matched with each other.

Further, step 2 further comprises: and enhancing the audio signal in the sound source direction by using a self-adaptive beam forming method, and eliminating the interference sound in other directions.

Further, in step 2, the sound source direction of the live speaker is estimated in real time by using the super-resolution spectrum.

Further, step 2 further comprises: judging whether a live speaker exists or not; and when the speaker is not present, taking the panoramic video image obtained in the step 1 as a panoramic video output image.

Further, in step 3, the audio signal and the video signal are subjected to data compression and then uploaded to an upper computer through a network. .

2. A panoramic video recording device based on voice tracking is characterized by comprising:

a housing;

the microphone array is arranged on the shell and used for acquiring multi-path audio signals;

the multi-channel video sensor is arranged on the shell and used for acquiring multi-channel video signals; and

set up the audio frequency video processing apparatus in the casing, including video processing module, audio frequency processing module, video recombination module and output module, wherein:

the video processing module acquires video signals acquired by the multiple paths of video sensors, and performs panoramic fusion and splicing to obtain a panoramic video image;

the audio processing module acquires multi-channel audio signals acquired by the microphone array, calculates the sound source direction of a speaker in real time, enhances the voice signals in the sound source direction and eliminates interference sounds in other directions;

the video recombination module intercepts local images at corresponding positions from the panoramic video image according to the sound source direction output by the audio processing module, and integrates the panoramic video image and the intercepted local images to generate a new image;

and the output module outputs the audio data processed by the audio processing module and the image generated by the video recombination module. .

The invention has the following beneficial effects:

the invention collects multi-channel audio signals through a microphone array; acquiring a plurality of paths of video signals through a plurality of paths of video sensors, and fusing and splicing panoramic videos to form a panoramic video image; estimating the sound source direction of a live speaker in real time according to the audio signal; intercepting a live speaker close-up image at a corresponding position in the panoramic video image according to the direction of a sound source, and integrating the live speaker close-up image and the panoramic video image to form a panoramic video output image; and transmitting the audio signal and the video signal of the panoramic video output image to an upper computer through a network or directly outputting the audio signal and the video signal through monitoring equipment. The method has simple flow, can effectively realize automatic generation of the panoramic image and the close-up image, and has real-time property.

The upper computer performs face recognition on the close-up image of the site speaker and recognizes the identity of the speaker; the voice recognition method has the advantages that the voice signals are recognized, the voice is converted into characters to be stored, and the identity of a speaker is marked on the data, so that the field recording is more convenient.

The microphone array consists of a plurality of microphones, wherein 1 microphone is positioned at the position of a circle center, and the rest microphones are uniformly distributed along the circumferential direction; the multiple paths of video sensors are uniformly distributed along the circumferential direction; the number and the position distribution of the microphones and the video sensors are matched with each other. According to the invention, the microphone array and the video sensors are distributed, so that the accurate positioning of the sound source direction can be effectively ensured, and the accuracy of intercepting the close-up image is further ensured.

The invention utilizes the self-adaptive beam forming method to enhance the audio signal in the sound source direction and eliminate the interference sound in other directions; the super-resolution spectrum is used for estimating the sound source direction of the site speaker in real time, and the judgment accuracy of the sound source direction and the collected audio signal quality of the site speaker can be effectively guaranteed.

Drawings

FIG. 1 is a flow chart of a panoramic video recording method based on voice tracking according to the present invention;

FIG. 2 is a schematic block diagram of the circuit of the panoramic video recording apparatus based on voice tracking according to the present invention;

FIG. 3 is a functional block diagram of a panoramic video recording apparatus based on voice tracking according to the present invention;

FIG. 4 is a schematic distribution diagram of a microphone array of the present invention;

FIG. 5 is a schematic view of a panoramic video image output of the present invention;

FIG. 6 is an output schematic diagram of a panoramic video image and 1 close-up image of the present invention;

FIG. 7 is an output schematic diagram of a panoramic video image and 2 close-up images of the present invention;

fig. 8 is a schematic overall appearance of a panoramic video recording apparatus based on voice tracking according to the present invention;

fig. 9 is a schematic overall appearance diagram of a panoramic video recording device based on voice tracking according to the present invention.

Detailed Description

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

The first embodiment is as follows:

panoramic video recording method based on voice tracking

As shown in fig. 1, a panoramic video recording method based on voice tracking includes the following steps: the method comprises the following steps:

step 1: and acquiring a plurality of paths of audio signals and a plurality of paths of video signals, and fusing and splicing the plurality of paths of video signals through panoramic videos to form a panoramic video image.

As shown in fig. 4, 8 and 9, the microphone array 1 is composed of a plurality of microphones, wherein 1 microphone is located at the center of a circle, and the rest microphones are uniformly distributed along the circumferential direction; as shown in fig. 8 and 9, the multiple video sensors 3 are uniformly distributed along the circumferential direction; the number and the position distribution of the microphones and the video sensors are matched with each other. In the embodiment, 6 paths of video sensors are provided; the microphone array consists of 7 microphones.

Step 2: estimating the sound source direction of a live speaker in real time according to the audio signal; and intercepting a live speaker close-up image at a corresponding position in the panoramic video image according to the sound source direction, and integrating the live speaker close-up image and the panoramic video image to form a panoramic video output image.

And estimating the sound source direction of the speaker on site in real time by using the super-resolution spectrum. And enhancing the audio signal in the sound source direction by using a self-adaptive beam forming method, and eliminating the interference sound in other directions.

As shown in fig. 5 to 7, it is determined whether there is a live speaker; and when the speaker is judged to be not live (namely, when no live sound source is judged), taking the panoramic video image obtained in the step 1 as a panoramic video output image. And when the live speaker is judged, integrating the live speaker close-up image and the panoramic video image to form a panoramic video output image. The close-up images of the live speakers are 1, 2 or more. Where 101A, 101B, 101C close-up images for live speakers.

And step 3: and transmitting the audio signal and the video signal of the panoramic video output image to an upper computer through a network or directly outputting the audio signal and the video signal through monitoring equipment.

The upper computer carries out face recognition on the close-up image of the site speaker and identifies the identity of the speaker; and identifying the audio signal, converting the voice into characters, storing the data, and labeling the identity of the speaker to the data. After data compression is carried out on the audio signal and the video signal, the audio signal and the video signal are uploaded to an upper computer through a network; or, the video signal is output in a scaling mode according to the resolution of the monitoring equipment. In this embodiment, the audio signal and the video signal data are encoded according to the h.265 format, and are sent to the upper computer through the network interface according to the UDP protocol. The monitoring equipment is HDMI equipment, and if the video signal is not zoomed, the HDMI equipment directly plays the original panoramic video output image.

Example two:

panoramic video recording device based on voice tracking

As shown in fig. 8 and 9, the microphone array 1 is disposed on the top of the housing 4, and the microphone array 1 is used for acquiring multiple audio signals; multichannel video sensor 3 sets up in 4 sides of casing for gather multichannel video signal. The microphone array 1 consists of a plurality of microphones, wherein 1 microphone is positioned at the position of a circle center, and the rest microphones are uniformly distributed along the circumferential direction; the number and the position distribution of the microphones and the video sensors are matched with each other. In the embodiment, 6 paths of video sensors are provided; the microphone array consists of 7 microphones. Be equipped with lamp area 2 on casing 4, lamp area 2 encircles the setting along casing circumferencial direction, and 4 below of casing are equipped with A-frame 5 or configuration base 6, and the circuit is hidden in the leg tube, and is pleasing to the eye and protective nature is good.

The audio/video processing device (main board) is disposed in the housing and is configured to implement the method described in the first embodiment. As shown in fig. 2 and fig. 3, the audio/video processing device (motherboard) can be divided into four functional modules: the device comprises a video processing module, an audio processing module, a video recombination module and an output module. The video processing module acquires the multi-channel signals acquired by the 6 channels of video sensors, and performs 6-mesh panoramic fusion and splicing to obtain a panoramic video image. The audio processing module acquires 7 paths of audio signals acquired by the microphone array. The method comprises the steps of utilizing super-resolution spectrum estimation to calculate the sound source direction of a speaker in real time, utilizing self-adaptive beam forming to enhance voice signals in the speaker direction, and eliminating interference sounds in other directions. The video recombination module intercepts a local image (a live speaker close-up image) at a corresponding position from the panoramic video according to the sound source direction information output by the audio processing module, and generates a new image by the panoramic video image and the intercepted image according to the resolution of the external display equipment in the memory. The output module outputs the audio data processed by the audio processing module and the image obtained by the video recombination module, can output two paths, one path of the output module compresses and encodes the audio and video signals, has an output format of H.265 and uploads the audio and video signals to an upper computer through a network protocol; and the other path of the audio/video signal can directly output the audio/video signal through the HDMI interface through the monitoring equipment.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

2. The method for recording panoramic video based on voice tracking according to claim 1, wherein the step 3 further comprises: carrying out face recognition on the close-up image of the live speaker to recognize the identity of the speaker; and identifying the audio signal, converting the voice into characters, storing the data, and labeling the identity of the speaker to the data.

3. The method of claim 1, wherein the panoramic video recording method based on voice tracking comprises: the multi-channel audio signals are collected through a microphone array, and the multi-channel video signals are collected through a multi-channel video sensor.

4. The method of claim 3, wherein the panoramic video recording based on voice tracking is as follows: the microphone array consists of a plurality of microphones, wherein 1 microphone is positioned at the position of a circle center, and the rest microphones are uniformly distributed along the circumferential direction;

5. The method for recording panoramic video based on voice tracking according to claim 1, wherein the step 2 further comprises: and enhancing the audio signal in the sound source direction by using a self-adaptive beam forming method, and eliminating the interference sound in other directions.

6. The method of claim 1, wherein the panoramic video recording method based on voice tracking comprises: in step 2, the sound source direction of the site speaker is obtained by utilizing the super-resolution spectrum to estimate in real time.

7. The method for recording panoramic video based on voice tracking according to claim 1, wherein the step 2 further comprises: judging whether a live speaker exists or not; and when the speaker is not present, taking the panoramic video image obtained in the step 1 as a panoramic video output image.

8. The method of claim 1, wherein the panoramic video recording method based on voice tracking comprises: and 3, performing data compression on the audio signal and the video signal, and uploading the audio signal and the video signal to an upper computer through a network.

9. A panoramic video recording device based on voice tracking is characterized by comprising:

a housing;

and the output module outputs the audio data processed by the audio processing module and the image generated by the video recombination module.

10. The panoramic video recording apparatus based on voice tracking as claimed in claim 9, wherein: the lamp belt is further arranged on the shell and arranged in a surrounding mode along the circumferential direction of the shell.