CN113740803A - Speaker positioning and tracking method and device based on audio and video characteristics - Google Patents

Speaker positioning and tracking method and device based on audio and video characteristics Download PDF

Info

Publication number
CN113740803A
CN113740803A CN202110955505.3A CN202110955505A CN113740803A CN 113740803 A CN113740803 A CN 113740803A CN 202110955505 A CN202110955505 A CN 202110955505A CN 113740803 A CN113740803 A CN 113740803A
Authority
CN
China
Prior art keywords
sound source
speaker
microphones
data
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110955505.3A
Other languages
Chinese (zh)
Inventor
戴李
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Chuangbian Information Technology Co ltd
Original Assignee
Anhui Chuangbian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Chuangbian Information Technology Co ltd filed Critical Anhui Chuangbian Information Technology Co ltd
Priority to CN202110955505.3A priority Critical patent/CN113740803A/en
Publication of CN113740803A publication Critical patent/CN113740803A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a speaker positioning and tracking method and device based on audio and video characteristics, which comprises the following steps: a coarse positioning calculation step, namely receiving position data of a plurality of microphones, receiving sound propagation data and acquired time data of the same speaker acquired by the plurality of microphones, and determining position data of a sound source based on different positions of the plurality of microphones and different time delay data of the same sound source; and a quasi-localization tracking step, which is configured to control a camera to collect images of the speaker object at the sound source position, wherein the center coordinates of the camera image and the position of each object in the speaker object in the video stream are kept fixed, and the center of the camera image and the position of the speaker object are synchronized. The invention optimizes the distribution structure of a plurality of microphones, realizes better picking effect on sound source data, further improves the accuracy of primary positioning of the speaker based on the audio information, and obtains accurate positioning of the speaker by utilizing the complementarity between the audio information and the video information.

Description

Speaker positioning and tracking method and device based on audio and video characteristics
Technical Field
The invention relates to the field of indoor positioning and tracking, in particular to a speaker positioning and tracking method and device based on audio and video characteristics.
Background
In the multi-speaker tracking problem based on audio and video feature fusion in an intelligent environment, a voice signal and a video signal of a speaker have strong complementarity and correlation. The complementarity is mainly embodied in that the audio information has the omnidirectional characteristic, but the positioning precision is poor; the acquisition of video information, while limited by the camera's view, can provide accurate positioning information. In addition, the video information is not influenced by acoustic environments such as background noise and room reverberation, and the audio information is irrelevant to the complexity of the visual scene. The correlation is represented by a correlation between the speaker's voice and lip movement information and a correlation between the time delay between the microphones at the plurality of locations and the location of the face of the speaker in the image. The method carries out heterogeneous information fusion tracking by utilizing the space-time correlation and complementarity between the audio signal and the video signal of the speaker, better overcomes the defects under the condition of single mode, and effectively improves the accuracy and the robustness of the speaker tracking system.
Disclosure of Invention
In view of the problems in the prior art, the present invention provides a speaker positioning and tracking method based on audio and video features, which includes:
a coarse localization calculation step configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data collected by the plurality of microphones of the same speaker sound, the time data collected by each microphone of the same speaker sound having associated time delay data indicating a time at which the microphone receives the same speaker sound and a time at which a reference microphone receives the speaker sound, wherein the different positions of the plurality of microphones and the reception of the different time delay data of the same sound source determine position data of the sound source;
a quasi-localization tracking step configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source position, the camera image center coordinates being fixed with the position of each of the speaker objects in the video stream, such that the camera image center is synchronized with the speaker object position.
As a further optimization of the above solution, the distribution structure of the preset microphone array is configured to distribute a plurality of microphones at preset positions on a plurality of spiral lines, the preset positions on the plurality of spiral lines are configured to arrange a plurality of position points uniformly or non-uniformly on one spiral line, the spiral line carrying the plurality of position points is copied to form a plurality of spiral lines and a plurality of position points on the plurality of spiral lines, and the plurality of copied spiral lines are configured to circumferentially distribute around the fixed point with equal radian.
As a further optimization of the above solution, the plurality of microphones are distributed at preset positions on a plurality of spiral lines, and are configured to copy the number n of the spiral lines and the parameter a of the spiral line equation according to the number n of the spiral linesiAnd (i) obtaining preset positions of the plurality of microphones by parameter matching combination formed by arc length increase proportion parameters c between two adjacent position points of a plurality of position points which are uniformly or non-uniformly arranged on one spiral line.
As a further optimization of the above solution, the parameter matching combination is configured to obtain an optimal parameter matching combination through an intelligent optimization algorithm so that the mixed audio received by the plurality of microphones has an optimal audio effect, and the optimal audio effect is configured such that the mixed audio received by the plurality of microphones has audio of a plurality of frequencies and amplitudes corresponding to the audio of the plurality of frequencies, where a maximum value and a minimum value of the frequency have the largest difference and a frequency corresponding to the maximum amplitude has only one and the maximum amplitude decreases to a maximum value and a minimum value of the frequency corresponding to a preset value.
As a further optimization of the above solution, the intelligent optimization algorithm is configured to:
(1) based on the parameters n, aiC, sampling a preset parameter matching combination in the parameter matching combination formed by the value space of c;
(2) establishing a Kriging agent model between the sampled parameter matching combination and the audio effect of the mixed audio received by the plurality of microphones acquired based on the sampled parameter matching combination;
(3) updating the sampled parameter matching combination by adopting an intelligent optimization algorithm, and determining a new audio effect through a Kriging agent model based on the updated parameter matching combination;
(4) based on the parameters n, aiSampling a new preset parameter matching combination in the parameter matching combination formed by the value space of c according to a preset sampling method;
(5) and (4) repeating the steps (2) to (4) until the new audio effect determined by the Kriging agent model based on the updated parameter matching combination in the step (3) is the preset optimal audio effect.
As a further optimization of the above solution, before determining the position data of the sound source by the different positions of the plurality of microphones and receiving the different time delay data of the same sound source, the method is configured to filter the interference terms for the audio of the same sound source received by the microphones through the neural network model to obtain the audio data actually received by the microphones from the sound source.
As a further optimization of the above solution, in the time data of each microphone receiving the same sound source, the associated time delay data indicating the time at which the microphone receives the sound source and the time at which the reference microphone receives the sound source is configured as the following steps: :
(1) respectively acquiring amplitude variation curves of sound source data for sound sources received by a reference microphone and sound source data received by a non-reference microphone;
(2) carrying out integral discrete point sampling based on the amplitude variation curve, and comparing and acquiring rough delay data of sound source data received by a reference microphone based on the sampling discrete points;
(3) carrying out local interception and sampling based on the amplitude variation curve, and comparing and acquiring fine delay data of sound source data received by a reference microphone based on the local interception and sampling curve section;
(4) and acquiring complete and accurate time delay data of the time when the reference microphone receives the sound source based on the rough time delay data and the fine time delay data.
The invention also provides a speaker positioning and tracking device based on the audio and video characteristics, which comprises:
a coarse localization module configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data of a same sound source received by the plurality of microphones, the time data of the same sound source received by each microphone having associated time delay data indicating a time at which the microphone received the sound source and a time at which a reference microphone received the sound source, wherein the different positions of the plurality of microphones and the different time delay data of the same sound source are received determine the position data of the sound source;
a quasi-location tracking module configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source location, the camera image center coordinates remaining fixed with a location of each of the speaker objects in the video stream, synchronizing the camera image center with the speaker object location.
The present invention also provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the speaker positioning and tracking method when the executable instructions stored in the memory are run.
The invention also provides a computer readable storage medium storing executable instructions which, when executed by a processor, implement the speaker location tracking method described above.
The speaker positioning and tracking method and device based on the audio and video characteristics have the following beneficial effects:
1. the speaker is initially positioned based on the audio information, the speaker is accurately positioned through video image acquisition in the initial positioning, the robustness of a target tracking system to a complex environment is further enhanced by fully utilizing the complementarity between the audio information and the video information of a target, and the defect that people use local information obtained by a single sensor to track the target is overcome.
2. The distribution structure of a plurality of microphones is optimized, and a better pickup effect on sound source data is realized.
3. Optimally designing the specific positions of the microphones in the distribution structure, acquiring the optimal parameter matching combination through an intelligent optimization algorithm so as to enable mixed audio received by the microphones to have the optimal audio effect,
drawings
FIG. 1 is a block flow diagram of a speaker location tracking method based on audio-video features of the present invention;
fig. 2 is a flowchart of a method for acquiring the distribution specific positions of the microphones by the intelligent optimization algorithm in fig. 1.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The application provides a speaker positioning and tracking method based on audio and video characteristics, which comprises the following steps:
a coarse localization calculation step configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data collected by the plurality of microphones of the same speaker sound, the time data collected by each microphone of the same speaker sound having associated time delay data indicating a time at which the microphone receives the same speaker sound and a time at which a reference microphone receives the speaker sound, wherein the different positions of the plurality of microphones and the reception of the different time delay data of the same sound source determine position data of the sound source;
a quasi-localization tracking step configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source position, the camera image center coordinates being fixed with the position of each of the speaker objects in the video stream, such that the camera image center is synchronized with the speaker object position.
The integration principle that information obtained through visual organs and auditory organs is integrated in the brain to obtain a result to be obtained is based on the brain of a person, initial positioning of a speaker is obtained based on audio information, accurate positioning of the speaker is carried out through video image collection for the initial positioning, the robustness of a target tracking system to a complex environment is further enhanced by fully utilizing the complementarity between the audio information and the video information of a target, and the defect that local information obtained by a single sensor is used by people for target tracking is overcome.
The method in which the speaker is preliminarily located by audio data in the present embodiment is described in detail as follows.
Based on the characteristics of sound wave propagation, microphones at different positions have different audio data response effects to sound sources, the distribution structures of the corresponding microphones are different, the response effect of the whole microphone on the sound source is also different, in order to achieve better picking-up effect of sound source data, the preset microphone array distribution structure is distributed, and is configured to distribute a plurality of microphones on preset positions on a plurality of spiral lines, the preset positions on the plurality of spiral lines are configured that a plurality of position points are uniformly or non-uniformly arranged on one spiral line, the spiral line carrying the plurality of position points is copied in multiple parts to form a plurality of spiral lines and a plurality of position points on the plurality of spiral lines, and the plurality of copied spiral lines are configured to be circumferentially distributed around the fixed point with equal radian, and based on the structure of circumferential distribution of the spiral lines, the specific positions of the fixed microphones are convenient to set, the noise reduction effect on the audio data is realized, and the definition of the audio data is improved.
Further, the specific positions of the microphones in the distribution structure are designed, the preset positions are distributed according to a preset rule by adopting a mode that the preset positions are distributed according to a preset rule, the preset positions are convenient to determine, namely the microphones are distributed at the preset positions on a plurality of spiral lines and are configured to be copied according to the spiral line number n and the spiral line equation parameter aiThe preset positions of the plurality of microphones are obtained by parameter matching combination formed by (i is the number of parameters of the polar coordinate equation of the spiral line) and arc length growth proportion parameters c between two adjacent position points of a plurality of position points which are uniformly or non-uniformly arranged on one spiral line
Figure BDA0003220341020000051
Where ρ represents the distance from a point on the curve to the pole, the angle between the line connecting the point on the θ curve and the pole and the polar axis, and a1 and a2 are constant parameters, and correspondingly, the location of the ith microphone can be expressed as (the total number of microphones on one spiral is L):
Figure BDA0003220341020000052
furthermore, the specific positions of the microphones in the distribution structure are optimally designed, the parameter matching combination is configured to obtain an optimal parameter matching combination through an intelligent optimization algorithm so that the mixed audio received by the plurality of microphones has an optimal audio effect, the optimal audio effect is configured to enable the mixed audio received by the plurality of microphones to have audio of a plurality of frequencies and corresponding amplitudes of the audio of the plurality of frequencies, wherein the difference between the maximum value and the minimum value of the frequency is maximum, the frequency corresponding to the maximum amplitude is only one, and the difference between the maximum amplitude and the maximum amplitude is reduced to the minimum value of the frequency corresponding to the preset value, wherein the audio frequency of the frequency corresponding to the maximum amplitude is related to the directional positioning, the acting distance and the anti-interference capability, the preferred design of the microphone distribution structure is such that it is possible to achieve a highlighting of the audio component of the frequency corresponding to the maximum amplitude and a weakening of the audio component of the remaining frequencies.
In this embodiment, the intelligent optimization algorithm is configured as the following steps:
(1) based on the parameters n, aiC, sampling a preset parameter matching combination in the parameter matching combination formed by the value space of c;
(2) establishing a Kriging agent model between the sampled parameter matching combination and the audio effect of the mixed audio received by the plurality of microphones acquired based on the sampled parameter matching combination;
(3) updating the sampled parameter matching combination by adopting an intelligent optimization algorithm, and determining a new audio effect through a Kriging agent model based on the updated parameter matching combination, wherein the intelligent optimization algorithm is a particle swarm optimization algorithm;
(4) based on the parameters n, aiC, sampling a new preset parameter matching combination in the parameter matching combination formed by the value space of c according to a preset sampling method, wherein the step comprises removing similarity between the sampled parameter matching combination and the original parameter matching combination and only reserving one similar parameter matching combination in the parameter matching combinations of a plurality of new samples;
(5) and (4) repeating the steps (2) to (4) until the new audio effect determined by the Kriging agent model based on the updated parameter matching combination in the step (3) is the preset optimal audio effect.
In the intelligent optimization searching algorithm, on one hand, the optimal microphone position positioning rule is searched by utilizing parameter combination optimization, and on the other hand, the optimization searching efficiency is improved and the searching time is shortened through the Kriging agent model.
For better positioning a speaker based on the response effect of multiple microphones on the voice of the speaker, before determining the position data of a sound source based on different positions of multiple microphones and receiving different time delay data of the same sound source, the method is configured to filter interference items on the audio of the same sound source received by the microphones through a neural network model to acquire the audio data actually coming from the sound source received by the microphones, the method for removing the interference items adopts the neural network model for training, the network model based on the training is used as an audio data filter for filtering the interference items, and in the specific neural network model, the network parameter correction process in the back propagation process is as follows:
calculating an error function of the actual output and the predicted output:
Figure BDA0003220341020000061
wherein ei(x) Is an error value;
Figure BDA0003220341020000062
if the kth iteration is carried out, the error is obtained based on the forward propagation of the network model parameters of the kth iterationDifference function E (x)(k)) If E (x)(k)) If the error is less than the preset error threshold value, the model training is finished, otherwise,
with x(k+1)Correcting the weight parameters of the network model and carrying out (k + 1) th forward propagation to obtain an error function E (x)(k+1)) And determining E (x)(k+1)) Whether or not less than E (x)(k)) If yes, correcting the weight parameter of the network model effectively by x(k+1)As a network model parameter after the (k + 1) th iteration, x is performed at (k + 2) th iteration by (μ ═ μ/β (μ decrease)(k+2)Calculate and combine x(k+2)Correcting the weight parameters of the network model, and performing forward propagation for the (k + 2) th time, otherwise, correcting the weight parameters of the network model inefficiently and still using x(k)As the network model parameter x after the (k + 1) th iteration(k+1)And x is performed k +2 times with μ ═ μ β (μ increase)(k+2)Calculating, will x(k+2)And correcting the weight parameters of the network model, and performing forward propagation for the (k + 2) th time.
Of course, considering that the interference items received when the microphone receives the speaker audio include linear and non-linear, the linear and non-linear interference items are removed as output targets when the neural network is trained, so the output data of the filter capable of removing the linear and non-linear interference items may be referred to for the target data of the training sample at the time of training.
The neural network filter obtained based on the training process effectively avoids mistakenly filtering the characteristic data of the original audio data when filtering the interference items, reduces the loss of the characteristics of the original data, improves the filtering effect and effectively filters various noises.
In order to achieve high-precision estimation of the time delay of the multiple microphones and the reference microphone for receiving the audio data, in this embodiment, a rough estimation with low precision is performed based on global data, and then a detailed estimation with high precision is performed based on local data, specifically, in the time data of the same sound source received by each microphone, the time delay data indicating the time when the microphone receives the sound source and the time when the reference microphone receives the sound source are associated, and the method is configured as the following steps:
(1) respectively acquiring amplitude variation curves of sound source data for sound sources received by a reference microphone and sound source data received by a non-reference microphone;
(2) carrying out integral discrete point sampling based on the amplitude variation curve, and comparing and acquiring rough delay data of sound source data received by a reference microphone based on the sampling discrete points;
(3) carrying out local interception and sampling based on the amplitude variation curve, and comparing and acquiring fine delay data of sound source data received by a reference microphone based on the local interception and sampling curve section;
(4) and acquiring complete and accurate time delay data of the time when the reference microphone receives the sound source based on the rough time delay data and the fine time delay data.
The time delay of the audio to be analyzed and the audio of the reference microphone in the rough time delay data and the fine time delay data can be calculated by adopting generalized cross-correlation, then the peak value based on the generalized cross-correlation is obtained, and for the audio data of the audio to be analyzed and the reference microphone which are locally intercepted and sampled, in order to obtain high-precision time delay data, the image interpolation amplification is carried out on the generalized cross-correlation curve of the audio to be analyzed and the audio of the reference microphone, the resolution is improved, the precise peak value is found out to be used as the time delay data based on the generalized cross-correlation curve image with the improved resolution, and the number of data point pairs which participate in convolution calculation in the cross-correlation function calculation process is reduced and the time of time delay estimation is shortened based on the rough time delay data and the fine time delay data.
This embodiment also provides a speaker location tracking device based on audio and video characteristics, including:
a coarse localization module configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data of a same sound source received by the plurality of microphones, the time data of the same sound source received by each microphone having associated time delay data indicating a time at which the microphone received the sound source and a time at which a reference microphone received the sound source, wherein the different positions of the plurality of microphones and the different time delay data of the same sound source are received determine the position data of the sound source;
a quasi-location tracking module configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source location, the camera image center coordinates remaining fixed with a location of each of the speaker objects in the video stream, synchronizing the camera image center with the speaker object location.
Specific definitions of the speaker location tracking device can be found in the above definitions of the speaker location tracking method, and will not be described herein. The various modules in the speaker location tracking device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The present embodiment also provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the speaker positioning and tracking method when the executable instructions stored in the memory are run.
The present embodiment also provides a computer-readable storage medium storing executable instructions that when executed by a processor implement the speaker location tracking method described above.
The electronic device provided by the embodiment comprises: including a processor, memory, and network interface, connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device has an operating system, a computer program and a database, and an environment for the operating system and the computer program to run. The database is used for storing received audio-video data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speaker location tracking method based on audio-video features.
It will be appreciated that the memory can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Non-volatile Memory may include Read Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The present invention is not limited to the above-described embodiments, and those skilled in the art will be able to make various modifications without creative efforts from the above-described conception, and fall within the scope of the present invention.

Claims (10)

1. A speaker positioning and tracking method based on audio and video features is characterized by comprising the following steps:
a coarse localization calculation step configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data collected by the plurality of microphones of the same speaker sound, the time data collected by each microphone of the same speaker sound having associated time delay data indicating a time at which the microphone receives the same speaker sound and a time at which a reference microphone receives the speaker sound, wherein the different positions of the plurality of microphones and the reception of the different time delay data of the same sound source determine position data of the sound source;
a quasi-localization tracking step configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source position, the camera image center coordinates being fixed with the position of each of the speaker objects in the video stream, such that the camera image center is synchronized with the speaker object position.
2. The method as claimed in claim 1, wherein the distribution structure of the predetermined microphone array is configured such that a plurality of microphones are distributed at predetermined positions on a plurality of spirals, the predetermined positions on the plurality of spirals are configured such that a plurality of position points are uniformly or non-uniformly arranged on one spiral, the spirals carrying the plurality of position points are duplicated to form a plurality of spirals and a plurality of position points on the plurality of spirals, and the plurality of duplicated spirals are configured to be circumferentially distributed around the fixed point with equal radian.
3. The method as claimed in claim 2, wherein the microphones are distributed at preset positions on a plurality of spiral lines and configured to copy the number n of the spiral lines and the parameter a of the spiral line equation according to the number n of the spiral linesiAnd (i) obtaining preset positions of the plurality of microphones by parameter matching combination formed by arc length increase proportion parameters c between two adjacent position points of a plurality of position points which are uniformly or non-uniformly arranged on one spiral line.
4. The method as claimed in claim 3, wherein the parameter matching combination is configured to obtain an optimal parameter matching combination through an intelligent optimization algorithm so that the mixed audio received by the plurality of microphones has an optimal audio effect, the optimal audio effect is configured such that the mixed audio received by the plurality of microphones has audio of a plurality of frequencies and amplitudes corresponding to the audio of the plurality of frequencies, wherein the maximum and minimum differences of the frequencies are the largest and the frequency corresponding to the maximum amplitude has only one, and the maximum amplitude decreases to the maximum and minimum differences of the frequencies corresponding to the preset values.
5. The method for locating and tracking a speaker based on audio-video features of claim 3, wherein the intelligent optimization algorithm is configured to:
(1) based on the parameters n, aiC, sampling a preset parameter matching combination in the parameter matching combination formed by the value space of c;
(2) establishing a Kriging agent model between the sampled parameter matching combination and the audio effect of the mixed audio received by the plurality of microphones acquired based on the sampled parameter matching combination;
(3) updating the sampled parameter matching combination by adopting an intelligent optimization algorithm, and determining a new audio effect through a Kriging agent model based on the updated parameter matching combination;
(4) based on the parameters n, aiSampling a new preset parameter matching combination in the parameter matching combination formed by the value space of c according to a preset sampling method;
(5) and (4) repeating the steps (2) to (4) until the new audio effect determined by the Kriging agent model based on the updated parameter matching combination in the step (3) is the preset optimal audio effect.
6. The method as claimed in claim 1, wherein the different locations of the microphones and the receiving of the different time delay data of the same sound source are configured to filter the interference terms of the audio of the same sound source received by the microphones through a neural network model to obtain the audio data actually received by the microphones from the sound source before determining the location data of the sound source.
7. A speaker localization tracking method based on audio-video features according to claim 1, wherein each microphone receives time data of the same sound source, and associated time delay data indicating the time of the microphone receiving the sound source and the time of the reference microphone receiving the sound source is configured to: :
(1) respectively acquiring amplitude variation curves of sound source data for sound sources received by a reference microphone and sound source data received by a non-reference microphone;
(2) carrying out integral discrete point sampling based on the amplitude variation curve, and comparing and acquiring rough delay data of sound source data received by a reference microphone based on the sampling discrete points;
(3) carrying out local interception and sampling based on the amplitude variation curve, and comparing and acquiring fine delay data of sound source data received by a reference microphone based on the local interception and sampling curve section;
(4) and acquiring complete and accurate time delay data of the time when the reference microphone receives the sound source based on the rough time delay data and the fine time delay data.
8. A speaker location tracking device based on audio-video features, comprising:
a coarse localization module configured to i) receive position data of a plurality of microphones, each microphone being distributed at a different location according to a preset microphone array distribution structure, and ii) receive time data of a same sound source received by the plurality of microphones, the time data of the same sound source received by each microphone having associated time delay data indicating a time at which the microphone received the sound source and a time at which a reference microphone received the sound source, wherein the different positions of the plurality of microphones and the different time delay data of the same sound source are received determine the position data of the sound source;
a quasi-location tracking module configured to control a camera to capture a video stream containing a plurality of image frames for a speaker object at a sound source location, the camera image center coordinates remaining fixed with a location of each of the speaker objects in the video stream, synchronizing the camera image center with the speaker object location.
9. An electronic device, characterized in that the electronic device comprises:
a memory for storing executable instructions;
a processor for implementing the speaker location tracking method of any one of claims 1 to 7 when executing the executable instructions stored by the memory.
10. A computer readable storage medium storing executable instructions, wherein the executable instructions when executed by a processor implement the speaker location tracking method of any one of claims 1 to 7.
CN202110955505.3A 2021-08-19 2021-08-19 Speaker positioning and tracking method and device based on audio and video characteristics Withdrawn CN113740803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110955505.3A CN113740803A (en) 2021-08-19 2021-08-19 Speaker positioning and tracking method and device based on audio and video characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110955505.3A CN113740803A (en) 2021-08-19 2021-08-19 Speaker positioning and tracking method and device based on audio and video characteristics

Publications (1)

Publication Number Publication Date
CN113740803A true CN113740803A (en) 2021-12-03

Family

ID=78731813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110955505.3A Withdrawn CN113740803A (en) 2021-08-19 2021-08-19 Speaker positioning and tracking method and device based on audio and video characteristics

Country Status (1)

Country Link
CN (1) CN113740803A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114205731A (en) * 2021-12-08 2022-03-18 随锐科技集团股份有限公司 Speaker area detection method, device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114205731A (en) * 2021-12-08 2022-03-18 随锐科技集团股份有限公司 Speaker area detection method, device, electronic equipment and storage medium
CN114205731B (en) * 2021-12-08 2023-12-26 随锐科技集团股份有限公司 Speaker area detection method, speaker area detection device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP7158806B2 (en) Audio recognition methods, methods of locating target audio, their apparatus, and devices and computer programs
CN107534725B (en) Voice signal processing method and device
CN109506568B (en) Sound source positioning method and device based on image recognition and voice recognition
CN106782584B (en) Audio signal processing device, method and electronic device
JP3962063B2 (en) System and method for improving accuracy of localization estimation
CN110610718B (en) Method and device for extracting expected sound source voice signal
WO2021128670A1 (en) Noise reduction method, device, electronic apparatus and readable storage medium
CN108109617A (en) A kind of remote pickup method
KR20110102466A (en) Estimating a sound source location using particle filtering
CN110495185B (en) Voice signal processing method and device
CN111445920A (en) Multi-sound-source voice signal real-time separation method and device and sound pick-up
CN115331692A (en) Noise reduction method, electronic device and storage medium
CN112614508A (en) Audio and video combined positioning method and device, electronic equipment and storage medium
CN113740803A (en) Speaker positioning and tracking method and device based on audio and video characteristics
CN111627456B (en) Noise elimination method, device, equipment and readable storage medium
Hosseini et al. Time difference of arrival estimation of sound source using cross correlation and modified maximum likelihood weighting function
CN111540365B (en) Voice signal determination method, device, server and storage medium
CN110927668A (en) Sound source positioning optimization method of cube microphone array based on particle swarm
CN113409800A (en) Processing method and device for monitoring audio, storage medium and electronic equipment
CN113223552B (en) Speech enhancement method, device, apparatus, storage medium, and program
JP6908142B1 (en) Sound collecting device, sound collecting program, and sound collecting method
CN111239691B (en) Multi-sound-source tracking method for restraining main sound source
WO2020186434A1 (en) Flexible differential microphone arrays with fractional order
WO2023088156A1 (en) Sound velocity correction method and apparatus
Wang et al. A robust generalized sidelobe canceller controlled by a priori sir estimate

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20211203

WW01 Invention patent application withdrawn after publication