CN113740803A

CN113740803A - Speaker positioning and tracking method and device based on audio and video characteristics

Info

Publication number: CN113740803A
Application number: CN202110955505.3A
Authority: CN
Inventors: 戴李
Original assignee: Anhui Chuangbian Information Technology Co ltd
Current assignee: Anhui Chuangbian Information Technology Co ltd
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2021-12-03

Abstract

The invention discloses a speaker positioning and tracking method and device based on audio and video characteristics, which comprises the following steps: a coarse positioning calculation step, namely receiving position data of a plurality of microphones, receiving sound propagation data and acquired time data of the same speaker acquired by the plurality of microphones, and determining position data of a sound source based on different positions of the plurality of microphones and different time delay data of the same sound source; and a quasi-localization tracking step, which is configured to control a camera to collect images of the speaker object at the sound source position, wherein the center coordinates of the camera image and the position of each object in the speaker object in the video stream are kept fixed, and the center of the camera image and the position of the speaker object are synchronized. The invention optimizes the distribution structure of a plurality of microphones, realizes better picking effect on sound source data, further improves the accuracy of primary positioning of the speaker based on the audio information, and obtains accurate positioning of the speaker by utilizing the complementarity between the audio information and the video information.

Description

Speaker positioning and tracking method and device based on audio and video characteristics

Technical Field

The invention relates to the field of indoor positioning and tracking, in particular to a speaker positioning and tracking method and device based on audio and video characteristics.

Background

In the multi-speaker tracking problem based on audio and video feature fusion in an intelligent environment, a voice signal and a video signal of a speaker have strong complementarity and correlation. The complementarity is mainly embodied in that the audio information has the omnidirectional characteristic, but the positioning precision is poor; the acquisition of video information, while limited by the camera's view, can provide accurate positioning information. In addition, the video information is not influenced by acoustic environments such as background noise and room reverberation, and the audio information is irrelevant to the complexity of the visual scene. The correlation is represented by a correlation between the speaker's voice and lip movement information and a correlation between the time delay between the microphones at the plurality of locations and the location of the face of the speaker in the image. The method carries out heterogeneous information fusion tracking by utilizing the space-time correlation and complementarity between the audio signal and the video signal of the speaker, better overcomes the defects under the condition of single mode, and effectively improves the accuracy and the robustness of the speaker tracking system.

Disclosure of Invention

In view of the problems in the prior art, the present invention provides a speaker positioning and tracking method based on audio and video features, which includes:

a coarse localization calculation step configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data collected by the plurality of microphones of the same speaker sound, the time data collected by each microphone of the same speaker sound having associated time delay data indicating a time at which the microphone receives the same speaker sound and a time at which a reference microphone receives the speaker sound, wherein the different positions of the plurality of microphones and the reception of the different time delay data of the same sound source determine position data of the sound source;

a quasi-localization tracking step configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source position, the camera image center coordinates being fixed with the position of each of the speaker objects in the video stream, such that the camera image center is synchronized with the speaker object position.

As a further optimization of the above solution, the distribution structure of the preset microphone array is configured to distribute a plurality of microphones at preset positions on a plurality of spiral lines, the preset positions on the plurality of spiral lines are configured to arrange a plurality of position points uniformly or non-uniformly on one spiral line, the spiral line carrying the plurality of position points is copied to form a plurality of spiral lines and a plurality of position points on the plurality of spiral lines, and the plurality of copied spiral lines are configured to circumferentially distribute around the fixed point with equal radian.

As a further optimization of the above solution, the plurality of microphones are distributed at preset positions on a plurality of spiral lines, and are configured to copy the number n of the spiral lines and the parameter a of the spiral line equation according to the number n of the spiral lines_iAnd (i) obtaining preset positions of the plurality of microphones by parameter matching combination formed by arc length increase proportion parameters c between two adjacent position points of a plurality of position points which are uniformly or non-uniformly arranged on one spiral line.

As a further optimization of the above solution, the parameter matching combination is configured to obtain an optimal parameter matching combination through an intelligent optimization algorithm so that the mixed audio received by the plurality of microphones has an optimal audio effect, and the optimal audio effect is configured such that the mixed audio received by the plurality of microphones has audio of a plurality of frequencies and amplitudes corresponding to the audio of the plurality of frequencies, where a maximum value and a minimum value of the frequency have the largest difference and a frequency corresponding to the maximum amplitude has only one and the maximum amplitude decreases to a maximum value and a minimum value of the frequency corresponding to a preset value.

As a further optimization of the above solution, the intelligent optimization algorithm is configured to:

(1) based on the parameters n, a_iC, sampling a preset parameter matching combination in the parameter matching combination formed by the value space of c;

(2) establishing a Kriging agent model between the sampled parameter matching combination and the audio effect of the mixed audio received by the plurality of microphones acquired based on the sampled parameter matching combination;

(3) updating the sampled parameter matching combination by adopting an intelligent optimization algorithm, and determining a new audio effect through a Kriging agent model based on the updated parameter matching combination;

(4) based on the parameters n, a_iSampling a new preset parameter matching combination in the parameter matching combination formed by the value space of c according to a preset sampling method;

(5) and (4) repeating the steps (2) to (4) until the new audio effect determined by the Kriging agent model based on the updated parameter matching combination in the step (3) is the preset optimal audio effect.

As a further optimization of the above solution, before determining the position data of the sound source by the different positions of the plurality of microphones and receiving the different time delay data of the same sound source, the method is configured to filter the interference terms for the audio of the same sound source received by the microphones through the neural network model to obtain the audio data actually received by the microphones from the sound source.

As a further optimization of the above solution, in the time data of each microphone receiving the same sound source, the associated time delay data indicating the time at which the microphone receives the sound source and the time at which the reference microphone receives the sound source is configured as the following steps: :

(1) respectively acquiring amplitude variation curves of sound source data for sound sources received by a reference microphone and sound source data received by a non-reference microphone;

(2) carrying out integral discrete point sampling based on the amplitude variation curve, and comparing and acquiring rough delay data of sound source data received by a reference microphone based on the sampling discrete points;

(3) carrying out local interception and sampling based on the amplitude variation curve, and comparing and acquiring fine delay data of sound source data received by a reference microphone based on the local interception and sampling curve section;

(4) and acquiring complete and accurate time delay data of the time when the reference microphone receives the sound source based on the rough time delay data and the fine time delay data.

The invention also provides a speaker positioning and tracking device based on the audio and video characteristics, which comprises:

a coarse localization module configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data of a same sound source received by the plurality of microphones, the time data of the same sound source received by each microphone having associated time delay data indicating a time at which the microphone received the sound source and a time at which a reference microphone received the sound source, wherein the different positions of the plurality of microphones and the different time delay data of the same sound source are received determine the position data of the sound source;

a quasi-location tracking module configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source location, the camera image center coordinates remaining fixed with a location of each of the speaker objects in the video stream, synchronizing the camera image center with the speaker object location.

The present invention also provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the speaker positioning and tracking method when the executable instructions stored in the memory are run.

The invention also provides a computer readable storage medium storing executable instructions which, when executed by a processor, implement the speaker location tracking method described above.

The speaker positioning and tracking method and device based on the audio and video characteristics have the following beneficial effects:

1. the speaker is initially positioned based on the audio information, the speaker is accurately positioned through video image acquisition in the initial positioning, the robustness of a target tracking system to a complex environment is further enhanced by fully utilizing the complementarity between the audio information and the video information of a target, and the defect that people use local information obtained by a single sensor to track the target is overcome.

2. The distribution structure of a plurality of microphones is optimized, and a better pickup effect on sound source data is realized.

3. Optimally designing the specific positions of the microphones in the distribution structure, acquiring the optimal parameter matching combination through an intelligent optimization algorithm so as to enable mixed audio received by the microphones to have the optimal audio effect,

drawings

FIG. 1 is a block flow diagram of a speaker location tracking method based on audio-video features of the present invention;

fig. 2 is a flowchart of a method for acquiring the distribution specific positions of the microphones by the intelligent optimization algorithm in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application provides a speaker positioning and tracking method based on audio and video characteristics, which comprises the following steps:

The integration principle that information obtained through visual organs and auditory organs is integrated in the brain to obtain a result to be obtained is based on the brain of a person, initial positioning of a speaker is obtained based on audio information, accurate positioning of the speaker is carried out through video image collection for the initial positioning, the robustness of a target tracking system to a complex environment is further enhanced by fully utilizing the complementarity between the audio information and the video information of a target, and the defect that local information obtained by a single sensor is used by people for target tracking is overcome.

The method in which the speaker is preliminarily located by audio data in the present embodiment is described in detail as follows.

Based on the characteristics of sound wave propagation, microphones at different positions have different audio data response effects to sound sources, the distribution structures of the corresponding microphones are different, the response effect of the whole microphone on the sound source is also different, in order to achieve better picking-up effect of sound source data, the preset microphone array distribution structure is distributed, and is configured to distribute a plurality of microphones on preset positions on a plurality of spiral lines, the preset positions on the plurality of spiral lines are configured that a plurality of position points are uniformly or non-uniformly arranged on one spiral line, the spiral line carrying the plurality of position points is copied in multiple parts to form a plurality of spiral lines and a plurality of position points on the plurality of spiral lines, and the plurality of copied spiral lines are configured to be circumferentially distributed around the fixed point with equal radian, and based on the structure of circumferential distribution of the spiral lines, the specific positions of the fixed microphones are convenient to set, the noise reduction effect on the audio data is realized, and the definition of the audio data is improved.

Further, the specific positions of the microphones in the distribution structure are designed, the preset positions are distributed according to a preset rule by adopting a mode that the preset positions are distributed according to a preset rule, the preset positions are convenient to determine, namely the microphones are distributed at the preset positions on a plurality of spiral lines and are configured to be copied according to the spiral line number n and the spiral line equation parameter a_iThe preset positions of the plurality of microphones are obtained by parameter matching combination formed by (i is the number of parameters of the polar coordinate equation of the spiral line) and arc length growth proportion parameters c between two adjacent position points of a plurality of position points which are uniformly or non-uniformly arranged on one spiral line

Where ρ represents the distance from a point on the curve to the pole, the angle between the line connecting the point on the θ curve and the pole and the polar axis, and a1 and a2 are constant parameters, and correspondingly, the location of the ith microphone can be expressed as (the total number of microphones on one spiral is L):

furthermore, the specific positions of the microphones in the distribution structure are optimally designed, the parameter matching combination is configured to obtain an optimal parameter matching combination through an intelligent optimization algorithm so that the mixed audio received by the plurality of microphones has an optimal audio effect, the optimal audio effect is configured to enable the mixed audio received by the plurality of microphones to have audio of a plurality of frequencies and corresponding amplitudes of the audio of the plurality of frequencies, wherein the difference between the maximum value and the minimum value of the frequency is maximum, the frequency corresponding to the maximum amplitude is only one, and the difference between the maximum amplitude and the maximum amplitude is reduced to the minimum value of the frequency corresponding to the preset value, wherein the audio frequency of the frequency corresponding to the maximum amplitude is related to the directional positioning, the acting distance and the anti-interference capability, the preferred design of the microphone distribution structure is such that it is possible to achieve a highlighting of the audio component of the frequency corresponding to the maximum amplitude and a weakening of the audio component of the remaining frequencies.

In this embodiment, the intelligent optimization algorithm is configured as the following steps:

(3) updating the sampled parameter matching combination by adopting an intelligent optimization algorithm, and determining a new audio effect through a Kriging agent model based on the updated parameter matching combination, wherein the intelligent optimization algorithm is a particle swarm optimization algorithm;

(4) based on the parameters n, a_iC, sampling a new preset parameter matching combination in the parameter matching combination formed by the value space of c according to a preset sampling method, wherein the step comprises removing similarity between the sampled parameter matching combination and the original parameter matching combination and only reserving one similar parameter matching combination in the parameter matching combinations of a plurality of new samples;

In the intelligent optimization searching algorithm, on one hand, the optimal microphone position positioning rule is searched by utilizing parameter combination optimization, and on the other hand, the optimization searching efficiency is improved and the searching time is shortened through the Kriging agent model.

For better positioning a speaker based on the response effect of multiple microphones on the voice of the speaker, before determining the position data of a sound source based on different positions of multiple microphones and receiving different time delay data of the same sound source, the method is configured to filter interference items on the audio of the same sound source received by the microphones through a neural network model to acquire the audio data actually coming from the sound source received by the microphones, the method for removing the interference items adopts the neural network model for training, the network model based on the training is used as an audio data filter for filtering the interference items, and in the specific neural network model, the network parameter correction process in the back propagation process is as follows:

calculating an error function of the actual output and the predicted output:

wherein e_i(x) Is an error value;

if the kth iteration is carried out, the error is obtained based on the forward propagation of the network model parameters of the kth iterationDifference function E (x)^(k)) If E (x)^(k)) If the error is less than the preset error threshold value, the model training is finished, otherwise,

with x^(k+1)Correcting the weight parameters of the network model and carrying out (k + 1) th forward propagation to obtain an error function E (x)^(k+1)) And determining E (x)^(k+1)) Whether or not less than E (x)^(k)) If yes, correcting the weight parameter of the network model effectively by x^(k+1)As a network model parameter after the (k + 1) th iteration, x is performed at (k + 2) th iteration by (μ ═ μ/β (μ decrease)^(k+2)Calculate and combine x^(k+2)Correcting the weight parameters of the network model, and performing forward propagation for the (k + 2) th time, otherwise, correcting the weight parameters of the network model inefficiently and still using x^(k)As the network model parameter x after the (k + 1) th iteration^(k+1)And x is performed k +2 times with μ ═ μ β (μ increase)^(k+2)Calculating, will x^(k+2)And correcting the weight parameters of the network model, and performing forward propagation for the (k + 2) th time.

Of course, considering that the interference items received when the microphone receives the speaker audio include linear and non-linear, the linear and non-linear interference items are removed as output targets when the neural network is trained, so the output data of the filter capable of removing the linear and non-linear interference items may be referred to for the target data of the training sample at the time of training.

The neural network filter obtained based on the training process effectively avoids mistakenly filtering the characteristic data of the original audio data when filtering the interference items, reduces the loss of the characteristics of the original data, improves the filtering effect and effectively filters various noises.

In order to achieve high-precision estimation of the time delay of the multiple microphones and the reference microphone for receiving the audio data, in this embodiment, a rough estimation with low precision is performed based on global data, and then a detailed estimation with high precision is performed based on local data, specifically, in the time data of the same sound source received by each microphone, the time delay data indicating the time when the microphone receives the sound source and the time when the reference microphone receives the sound source are associated, and the method is configured as the following steps:

The time delay of the audio to be analyzed and the audio of the reference microphone in the rough time delay data and the fine time delay data can be calculated by adopting generalized cross-correlation, then the peak value based on the generalized cross-correlation is obtained, and for the audio data of the audio to be analyzed and the reference microphone which are locally intercepted and sampled, in order to obtain high-precision time delay data, the image interpolation amplification is carried out on the generalized cross-correlation curve of the audio to be analyzed and the audio of the reference microphone, the resolution is improved, the precise peak value is found out to be used as the time delay data based on the generalized cross-correlation curve image with the improved resolution, and the number of data point pairs which participate in convolution calculation in the cross-correlation function calculation process is reduced and the time of time delay estimation is shortened based on the rough time delay data and the fine time delay data.

This embodiment also provides a speaker location tracking device based on audio and video characteristics, including:

Specific definitions of the speaker location tracking device can be found in the above definitions of the speaker location tracking method, and will not be described herein. The various modules in the speaker location tracking device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The present embodiment also provides an electronic device, including:

a memory for storing executable instructions;

The present embodiment also provides a computer-readable storage medium storing executable instructions that when executed by a processor implement the speaker location tracking method described above.

The electronic device provided by the embodiment comprises: including a processor, memory, and network interface, connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device has an operating system, a computer program and a database, and an environment for the operating system and the computer program to run. The database is used for storing received audio-video data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speaker location tracking method based on audio-video features.

It will be appreciated that the memory can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Non-volatile Memory may include Read Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The present invention is not limited to the above-described embodiments, and those skilled in the art will be able to make various modifications without creative efforts from the above-described conception, and fall within the scope of the present invention.

Claims

1. A speaker positioning and tracking method based on audio and video features is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the distribution structure of the predetermined microphone array is configured such that a plurality of microphones are distributed at predetermined positions on a plurality of spirals, the predetermined positions on the plurality of spirals are configured such that a plurality of position points are uniformly or non-uniformly arranged on one spiral, the spirals carrying the plurality of position points are duplicated to form a plurality of spirals and a plurality of position points on the plurality of spirals, and the plurality of duplicated spirals are configured to be circumferentially distributed around the fixed point with equal radian.

3. The method as claimed in claim 2, wherein the microphones are distributed at preset positions on a plurality of spiral lines and configured to copy the number n of the spiral lines and the parameter a of the spiral line equation according to the number n of the spiral lines_iAnd (i) obtaining preset positions of the plurality of microphones by parameter matching combination formed by arc length increase proportion parameters c between two adjacent position points of a plurality of position points which are uniformly or non-uniformly arranged on one spiral line.

4. The method as claimed in claim 3, wherein the parameter matching combination is configured to obtain an optimal parameter matching combination through an intelligent optimization algorithm so that the mixed audio received by the plurality of microphones has an optimal audio effect, the optimal audio effect is configured such that the mixed audio received by the plurality of microphones has audio of a plurality of frequencies and amplitudes corresponding to the audio of the plurality of frequencies, wherein the maximum and minimum differences of the frequencies are the largest and the frequency corresponding to the maximum amplitude has only one, and the maximum amplitude decreases to the maximum and minimum differences of the frequencies corresponding to the preset values.

5. The method for locating and tracking a speaker based on audio-video features of claim 3, wherein the intelligent optimization algorithm is configured to:

6. The method as claimed in claim 1, wherein the different locations of the microphones and the receiving of the different time delay data of the same sound source are configured to filter the interference terms of the audio of the same sound source received by the microphones through a neural network model to obtain the audio data actually received by the microphones from the sound source before determining the location data of the sound source.

7. A speaker localization tracking method based on audio-video features according to claim 1, wherein each microphone receives time data of the same sound source, and associated time delay data indicating the time of the microphone receiving the sound source and the time of the reference microphone receiving the sound source is configured to: :

8. A speaker location tracking device based on audio-video features, comprising:

9. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the speaker location tracking method of any one of claims 1 to 7 when executing the executable instructions stored by the memory.

10. A computer readable storage medium storing executable instructions, wherein the executable instructions when executed by a processor implement the speaker location tracking method of any one of claims 1 to 7.