CN107820037B - Audio signal, image processing method, device and system - Google Patents

Audio signal, image processing method, device and system Download PDF

Info

Publication number
CN107820037B
CN107820037B CN201610826122.5A CN201610826122A CN107820037B CN 107820037 B CN107820037 B CN 107820037B CN 201610826122 A CN201610826122 A CN 201610826122A CN 107820037 B CN107820037 B CN 107820037B
Authority
CN
China
Prior art keywords
detected
included angle
calculating
microphone array
preset algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610826122.5A
Other languages
Chinese (zh)
Other versions
CN107820037A (en
Inventor
任志平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201610826122.5A priority Critical patent/CN107820037B/en
Priority to PCT/CN2017/097397 priority patent/WO2018049957A1/en
Publication of CN107820037A publication Critical patent/CN107820037A/en
Application granted granted Critical
Publication of CN107820037B publication Critical patent/CN107820037B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Abstract

The invention provides a method, a device and a system for processing audio signals and images, wherein the method, the device and the system are used for calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of an object to be detected; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; the current position of the object to be detected is obtained by combining the first prediction position and the second prediction position and correcting according to the time continuity of the audio signal, the problems that the position of a speaker cannot be displayed in time and the multimedia information of the speaker cannot be obtained in a tracking mode in a remote video conference system due to the fact that a speaker position tracking technology is lacked are solved, and the effects of obtaining the position of the speaker in time and obtaining the multimedia information of the speaker in a tracking mode are achieved.

Description

Audio signal, image processing method, device and system
Technical Field
The invention relates to the field of application of voice recognition technology, in particular to a method, a device and a system for processing audio signals and images.
Background
With the rapid development of video communication technology, teleconferencing television services are increasingly emerging. In the use process of the remote video conference system, how to position and display the voice of the speaker through the equipment becomes a problem to be solved in the existing remote video conference system.
Aiming at the problems that the position of a speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and acquired in a remote video conference system due to the lack of a speaker position tracking technology in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides an audio signal and image processing method, device and system, which are used for at least solving the problems that the position of a speaker cannot be displayed in time and the multimedia information of the speaker cannot be obtained in a tracking manner in a remote video conference system due to the lack of a speaker position tracking technology in the related technology.
According to an embodiment of the present invention, there is provided an audio signal processing method including: calculating according to a first preset algorithm according to the audio signals collected by the microphones to obtain a first predicted position of the object to be detected; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.
Optionally, calculating according to a first preset algorithm based on the audio signals collected by the multiple microphones to obtain a first predicted position of the object to be detected includes: classifying a plurality of microphones into a first microphone array and a second microphone array; calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm; and calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.
Further, optionally, calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm includes: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.
Optionally, calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm includes: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.
Optionally, the filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected includes: respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array by a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.
Further, optionally, after obtaining the current position of the object to be detected, the method further includes: and updating the Kalman filter parameters according to the current position of the object to be detected.
Further, optionally, after obtaining the current position of the object to be detected, the method further includes: and enhancing the voice output of the object to be detected.
According to another embodiment of the present invention, there is provided an image processing method including: acquiring a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array; respectively calculating a first included angle between a first microphone array and the image acquisition equipment, which corresponds to the first depth value, and calculating a second included angle between a second microphone array and the image acquisition equipment, which corresponds to the second depth value; constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; and acquiring the position of the object to be detected, and determining the position of the object to be detected in the multi-dimensional space coordinate system according to the multi-dimensional space coordinate system.
Optionally, respectively calculating a first type of included angle between the first microphone array and the image acquisition device, which corresponds to the first depth value, and calculating a second type of included angle between the second microphone array and the image acquisition device, which corresponds to the second depth value, include: and calculating a first type included angle and a second type included angle according to the preset conditions of the first depth, the second depth and the actual distance.
According to still another embodiment of the present invention, there is provided an apparatus for audio signal processing, including: the first calculation module is used for calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of the object to be detected; the second calculation module is used for calculating after filtering the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and the correcting module is used for correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.
According to still another embodiment of the present invention, there is provided an apparatus for image processing, including: acquiring a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array; the calculating module is used for calculating a first included angle between the first microphone array and the image acquisition equipment corresponding to the first depth value and calculating a second included angle between the second microphone array and the image acquisition equipment corresponding to the second depth value; the coordinate space module is used for constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; and the acquisition module is used for acquiring the position of the object to be detected and determining the position of the object to be detected in the multi-dimensional space coordinate system according to the multi-dimensional space coordinate system.
According to an embodiment of the present invention, there is provided a voice, image processing system including: the system comprises a video conference terminal, image acquisition equipment, depth image acquisition equipment, a sound acquisition module consisting of a plurality of microphone arrays and display equipment, wherein the sound acquisition module consisting of the plurality of microphone arrays is used for acquiring audio signals of an object to be detected; the image acquisition equipment is used for acquiring all video images in the meeting place; the depth image acquisition equipment is used for acquiring a depth image in the meeting place, and the depth image is used for acquiring position information between the participants and the depth image acquisition equipment; and the video conference terminal is used for tracking the positions of the participants, displaying the images of the participants during speaking and recording the conference.
According to still another embodiment of the present invention, there is also provided a storage medium. The storage medium is configured to store program code for performing the steps of: calculating according to a first preset algorithm according to the audio signals collected by the microphones to obtain a first predicted position of the object to be detected; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.
Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating according to a first preset algorithm according to the audio signals collected by the microphones, and obtaining a first predicted position of the object to be detected comprises: classifying a plurality of microphones into a first microphone array and a second microphone array; calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm; and calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.
Further, optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm comprises the following steps: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.
Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm comprises: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, Euclidean distances among audio signals collected by all microphones in the second microphone array are calculated; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.
Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating after filtering the historical position of the object to be detected according to a second preset algorithm, and obtaining a second predicted position of the object to be detected comprises: respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array by a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.
Further, optionally, the storage medium is further configured to store program code for performing the steps of: after obtaining the current position of the object to be detected, the method further comprises: and updating the Kalman filter parameters according to the current position of the object to be detected.
Further, optionally, the storage medium is further arranged to store program code for performing the steps of: after obtaining the current position of the object to be detected, the method further comprises: and enhancing the voice output of the object to be detected.
According to the invention, the first predicted position of the object to be detected is obtained by calculating according to the audio signals collected by the microphones according to the first preset algorithm; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a flow chart of a method of audio signal processing according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the relationship between two microphone arrays and the speaker position in the audio signal processing method according to the embodiment of the invention;
FIG. 3 is a schematic diagram illustrating the calculation of the speaker position relative to the microphone array in the method for processing audio signals according to the embodiment of the invention;
FIG. 4 is a schematic diagram of the TDOA algorithm in the method of audio signal processing according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of joint multi-head pair TDOA algorithm location in a method of audio signal processing according to an embodiment of the present invention;
FIG. 6 is a flow diagram of a method of image processing according to an embodiment of the present invention;
FIG. 7 is a system device layout diagram of a method of image processing according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a method for image processing using a microphone array to measure television distance in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of an angle between a depth axis of a depth camera and a connecting line of a microphone array according to depth information in the method for processing an image according to the embodiment of the present invention;
fig. 10 is a schematic structural diagram of an apparatus for audio signal processing according to an embodiment of the present invention;
fig. 11 is a schematic configuration diagram of an apparatus for image processing according to an embodiment of the present invention;
FIG. 12 is a schematic diagram of an audio signal and image processing system according to an embodiment of the present invention
FIG. 13 is a diagram illustrating a text display method corresponding to an interesting voice.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The technical terms related to the embodiments of the present application are:
TDOA: the Time Difference of Arrival algorithm, Time Difference of Arrival.
Example 1
Fig. 1 is a flowchart of a method of audio signal processing according to an embodiment of the present invention, as shown in fig. 1, the flowchart including the steps of:
step S102, calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of an object to be detected;
step S104, filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected;
and S106, correcting according to the time continuity of the audio signal by combining the first prediction position and the second prediction position to obtain the current position of the object to be detected.
Through the steps, the first predicted position of the object to be detected is obtained by calculating according to the first preset algorithm and the audio signals collected by the microphones; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.
The audio signal processing method provided by the embodiment of the application can be applied to a sound source tracking and positioning technology, wherein the sound source positioning technology has high application prospect and use value, for example, the method can be used for detecting the position of a speaker and automatically focusing a video image to the speaker, so that a listener can better observe a speaker and even can perceive the fine facial expression of the speaker, and therefore the listener can listen to stronger live feeling and better understand and feel the content to be expressed by the speaker.
Optionally, in step S102, calculating according to a first preset algorithm based on the audio signals collected by the multiple microphones to obtain a first predicted position of the object to be detected includes:
step1, classifying the microphones into a first microphone array and a second microphone array;
step2, calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm;
step3, calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.
Further, optionally, calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm in Step2 includes:
under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.
Optionally, calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm in Step2 includes: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.
Optionally, in step S104, the filtering and calculating the historical position of the object to be detected according to the second preset algorithm, and obtaining the second predicted position of the object to be detected includes:
respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array through a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.
Further, optionally, after obtaining the current position of the object to be detected in step S106, the method for processing an audio signal provided in the embodiment of the present invention further includes: and updating the Kalman filter parameters according to the current position of the object to be detected.
Further, optionally, after obtaining the current position of the object to be detected in step S106, the method for processing an audio signal provided in the embodiment of the present invention further includes: and enhancing the voice output of the object to be detected.
To sum up, the method for processing an audio signal provided by the embodiment of the present invention specifically includes:
fig. 2 is a schematic diagram of the relationship between two microphone arrays and the speaker in the method for processing audio signals according to the embodiment of the present invention, and as shown in fig. 2, assuming that a certain conference room has two microphone arrays MicA and MicB, each having four microphones, the MicA/MicB array is regarded as a set of microphones, i.e., MicA ═ { MicA0, MicA1, MicA2, MicA3}, and MicB ═ MicB0, MicB1, MicB2, MicB3 }. In general, only one person speaks in a certain time period in a certain meeting place in the video conference process, so we assume that only one person speaks in a certain meeting place at time t, and the positional relationship of the speaker with respect to MicA and MicB is shown in fig. 2.
At this time, the included angle between the speaker and MicA and MicB is not zero, and the included angle between the speaker and MicA is assumed to be theta0The included angle between the speaker and MicB is theta1Since the distance between MicA and MicB is known, the speaker position can be easily predicted according to trigonometric theorem, as shown in fig. 3, fig. 3 is a schematic diagram illustrating the speaker position calculation relative to the microphone array in the audio signal processing method according to the embodiment of the present invention.
Angle theta between speaker and MicA/MicB0And theta1Can be derived from a TDOA equal time delay estimation algorithm, as shown in fig. 4, fig. 4 is a schematic diagram of a TDOA algorithm in the audio signal processing method according to the embodiment of the present invention.
Assuming that the speed of speech propagation is fixed at gamma, the angle theta between the sound source and MicA0/MicA10(the line between MicA0 and MicA1 is parallel to the line between MicA and MicB), and MicA0 and MicA1 are spaced l apart0Due to the different distances of the sound source from MicA0 and MicA1, there is a difference in the time to reach MicA1 and MicA0 from the sound source, with a time difference Δ t:
Δt=l0cosθ0
the above differential information appears on the microphone MicA0 and MicA1, namely MicA0 has time delay compared with the voice sequence sampled by MicA1, and the maximum time delay of MicA0 and MicA1 does not exceed l, assuming that the sampling rates of MicA0 and MicA1 are S0And/gamma. Under the constraint, MicA0 sampled speech sequence X ═ { X ═ X0,x1,x2,…,xnY-Y of speech sequence sampled with MicA10,y1,y2,…,ynX is in mu e | -S |0/γ,S*l0The offset between/γ |, yields X ═ X0+μ,x1+μ,x2+μ,…,xn+μThe Euclidean distance between X' and Y is:
Figure 2
wherein δ | μ ∈ [ -S |0/γ,S*l0/γ]Having a minimum value deltamin,δminCorresponding offset mu | deltaminAccording to μ | δminAvailable speaker and MicA0 and]included angle theta between connecting lines of MicA10
Figure BDA0001115099640000102
MicA has four heads { MicA0, MicA1, MicA2, MicA3}, and there are 6 pairs of heads { MicA0, MicA1}, { MicA1, MicA2}, { MicA2, MicA3}, { MicA0, MicA2}, and { MicA1, MicA3}, as shown in FIG. 4. A set of estimates of the speaker's direction { theta ] can be obtained for 6 microphone pairs0,0,θ0,1,θ0,2,θ0,3,θ0,4,θ0,5Get their mean value
Figure BDA0001115099640000103
As a result of prediction of speaker direction
Figure BDA0001115099640000104
It is verified by experiment that a deviation of 5 ° is allowed. Fig. 5 is a schematic diagram of joint multi-head pair TDOA algorithm location in a method of audio signal processing according to an embodiment of the present invention.
The same algorithm is adopted for four heads { MicB0, MicB1, MicB2 and MicB3} of MicB to obtain theta1Predicted result of (2)
Figure BDA0001115099640000105
The speaker position prediction result is obtained by simple trigonometric function operation from FIG. 3
Figure BDA0001115099640000106
Through the algorithm, a series of prediction results of the speaker position can be obtained within a period of time
Figure BDA0001115099640000111
However, due to interference of noise and the like, prediction results obtained by the algorithm are not accurate enough, so that the position of the speaker is predicted and tracked based on Kalman filtering, the position is used as a constraint condition for predicting the direction angle of the speaker, and the accuracy of the TDOA algorithm positioning by the combined multiple microphones is improved.
The method comprises the following steps: predicting the position of a speaker at the current time by Kalman filtering
Figure BDA0001115099640000112
And converted into the prediction of the included angle relative to the connecting line of MicA and MicB
Figure BDA0001115099640000113
And
Figure BDA0001115099640000114
step two: for each microphone, calculating the time delay of each microphone pair by using a TDOA algorithm and converting the time delay into an included angle relative to a connecting line of MicA and MicB to obtain an estimated value of a group of speaker directions: { theta ]i,0,θi,1,θi,2,θi,3,θi,4,θi,5};
Step three: if it is
Figure BDA0001115099640000115
The prediction result of Kalman filtering is considered to be too large in deviation and needs to be abandoned directly
Figure BDA0001115099640000116
As a result of the prediction of the speaker's direction at the current time
Figure BDA0001115099640000117
Otherwise, the prediction result of Kalman filtering is considered to be acceptable, and
Figure BDA0001115099640000118
is excluded, i.e., U θ ═ θ'i,0,θ‘i,1,…,θ‘i,n-1},
Figure BDA0001115099640000119
1<=n<=6,1<=j<When n is equal to n, the
Figure BDA00011150996400001110
As a result of the prediction of the speaker's direction at the current time
Figure BDA00011150996400001111
Step four: step two and step three are executed to the two microphone arrays to obtain the prediction result of the speaker direction at the current moment
Figure BDA00011150996400001112
And
Figure BDA00011150996400001113
and obtaining the speaker position according to simple trigonometric function operation
Figure BDA00011150996400001114
And updating the kalman filter parameters.
Example 2
Fig. 6 is a flowchart of a method of image processing according to an embodiment of the present invention, as shown in fig. 6, the flowchart including the steps of:
step S602, acquiring a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array;
step S604, respectively calculating a first included angle between a first microphone array and the image acquisition equipment, which corresponds to the first depth value, and calculating a second included angle between a second microphone array and the image acquisition equipment, which corresponds to the second depth value;
step S606, constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle;
step S608, acquiring a position of the object to be detected, and determining the position of the object to be detected in the multidimensional space coordinate system according to the multidimensional space coordinate system.
Through the steps, the first depth value of the first microphone array and the image acquisition device of the display device and the second depth value of the second microphone array and the image acquisition device of the display device are obtained through the preset microphone array; respectively calculating a first included angle between a first microphone array and the image acquisition equipment, which corresponds to the first depth value, and calculating a second included angle between a second microphone array and the image acquisition equipment, which corresponds to the second depth value; constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; and acquiring the position of the object to be detected, and determining the position of the object to be detected in the multi-dimensional space coordinate system according to the multi-dimensional space coordinate system. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.
Optionally, the step S604 of respectively calculating a first type of included angle between the first microphone array and the image capturing device corresponding to the first depth value, and calculating a second type of included angle between the second microphone array and the image capturing device corresponding to the second depth value includes: and calculating a first type included angle and a second type included angle according to the preset conditions of the first depth, the second depth and the actual distance.
In summary, the image processing method provided in the embodiment of the present application is specifically as follows:
the system requires the relative positions of the microphone array, the depth camera, the image camera and the television to be fixed, as shown in fig. 7 below, and fig. 7 is a layout diagram of system equipment of the image processing method according to the embodiment of the invention.
In the system, the distance between MicA and MicB is known, the distance is generally 2-3 m, the width of a television can be measured, and the connecting line between the television and MicA and MicB is kept horizontal. The distance between the television and the connecting lines between MicA and MicB is unknown, and the television is placed according to the area of a conference room. When the system is installed for the first time, the video conference device controls the tv to play a pre-recorded voice, and estimates the positions (including directions and distances) of the tv relative to MicA and MicB through the above-mentioned joint multi-microphone-pair TDOA algorithm, as shown in fig. 8, fig. 8 is a schematic diagram of measuring tv distances by using a microphone array in the image processing method according to the embodiment of the present invention.
Because the microphone array has special shape and color, the microphone array can be identified in the image camera, and corresponding Depth information can be obtained in the Depth camera, assuming that the Depth of MicA is Depth0 and the Depth of MicB is Depth1, the angle of the camera relative to MicA and MicB can be calculated by utilizing trigonometric function.
Fig. 9 is a schematic diagram of an angle between a depth axis of a depth camera and a connection line of a microphone array according to depth information in an image processing method according to an embodiment of the present invention, as shown in fig. 9, since the microphone array is located in a direction and a position relative to the microphone array, and the depth camera is used for locating in a direction and a position relative to the depth camera, a system needs to use two kinds of information to achieve accurate speaker location, and a coordinate system needs to be converted. The microphone array used in the system can only be positioned to a two-dimensional space position, and corresponds to two coordinate axes of left and right and depth in the depth camera, namely an x axis and a z axis. Let us assume that, in the two-dimensional space of the microphone array, the coordinates of MicA are (0,0) and the coordinates of MicB are (length,0), where length is the microphone array pitch. After the positioning by the microphone array, the coordinates of the depth camera (at the same position as the television) are (x, y), and the direction relative to MicA is theta0In a direction of θ with respect to MicB1. The depth of the MicA array in the depth camera is depth0, MiThe depth of cB is depth 1. According to the above information, although the depth information is not equal to the actual distance, the actual distance satisfies:
y=f(depth)
wherein y is0And y1Is the actual distance of MicA and MicB in the depth direction relative to the depth camera. From the trigonometric function we can obtain:
Figure BDA0001115099640000131
namely:
Figure BDA0001115099640000132
from the knowledge of the triangle geometry, θ 2 and θ 3 satisfy:
θ23=θ01
the final calculation can be:
Figure BDA0001115099640000141
θ2=(θ01)-θ3
note that only the case where θ 2 and θ 3 and θ 0 and θ 1 are both acute angles is analyzed here, and the other cases are similar. According to the method, the two-dimensional space position where the microphone array is positioned relative to the microphone array can be converted into the left-right and depth axis positions in the three-dimensional space of the depth camera.
Since the user changes the camera angle (the position of the microphone array is fixed) during the use process, once the camera angle changes, the system must have the automatic updating parameters, and the operation is carried out again to convert the two-dimensional space position of the microphone array into the left and right and depth axis positions in the three-dimensional space of the depth camera. And because the user can change the angle of the camera in the meeting process, the preset recording can not be played, the system can determine that only a television plays and no local speaker speaks in a certain time period by utilizing single-remote judgment in an echo cancellation algorithm, thereby ensuring that the operation is not interfered and the calculation result is accurate enough. The method can convert the position of the speaker estimated by the microphone array to the position in the depth/image camera, and then the position of the speaker in the depth/image camera can be obtained by algorithms such as skin color detection, face detection and the like.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 3
In this embodiment, an audio signal processing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 10 is a schematic structural diagram of an apparatus for audio signal processing according to an embodiment of the present invention, as shown in fig. 10, the apparatus including:
the first calculating module 1002 is configured to calculate according to a first preset algorithm based on audio signals acquired by multiple microphones to obtain a first predicted position of an object to be detected; the second calculating module 1004 is configured to calculate after filtering the historical position of the object to be detected according to a second preset algorithm, so as to obtain a second predicted position of the object to be detected; the correcting module 1006 is configured to correct the first predicted position and the second predicted position according to the temporal continuity of the audio signal, so as to obtain a current position of the object to be detected.
In the audio signal processing device according to the embodiment of the invention, the first predicted position of the object to be detected is obtained by calculating according to the audio signals collected by the microphones according to the first preset algorithm; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.
Example 4
Fig. 11 is a schematic structural diagram of an apparatus for image processing according to an embodiment of the present invention, as shown in fig. 11, the apparatus including:
the acquisition module 1102 is configured to acquire a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array; a calculating module 1104, configured to calculate a first type of included angle between the first microphone array and the image capturing device corresponding to the first depth value and calculate a second type of included angle between the second microphone array and the image capturing device corresponding to the second depth value, respectively; a coordinate space module 1106, configured to construct a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; the obtaining module 1108 is configured to obtain a position of the object to be detected, and determine the position of the object to be detected in the multidimensional space coordinate system according to the multidimensional space coordinate system.
In the image processing apparatus according to the embodiment of the present invention, the multi-dimensional space coordinate system is constructed according to the first depth value, the second depth value, the first included angle, and the second included angle. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.
Example 5
According to an embodiment of the present invention, there is provided an audio signal, image processing system including: the system comprises a video conference terminal, image acquisition equipment, depth image acquisition equipment, a sound acquisition module consisting of a plurality of microphone arrays and display equipment, wherein the sound acquisition module consisting of the plurality of microphone arrays is used for acquiring audio signals of an object to be detected; the image acquisition equipment is used for acquiring all video images in the meeting place; the depth image acquisition equipment is used for acquiring a depth image in the meeting place, and the depth image is used for acquiring position information between the participants and the depth image acquisition equipment; and the video conference terminal is used for tracking the positions of the participants, displaying the images of the participants during speaking and recording the conference.
In summary, with reference to embodiments 1 to 5, the audio signal and image processing method, apparatus and system provided in the embodiments of the present application are specifically as follows:
firstly, the system predicts and tracks the position of the speaker in real time according to a TDOA algorithm by combining multiple microphones, predicts and tracks the position of the speaker by using Kalman filtering, and carries out self-correction according to the continuity of voice signals in time to obtain accurate estimation of the position of the speaker.
In addition, a depth camera is fixedly arranged in the system, and the depth information of each indoor participant is acquired through the depth camera and is used as a constraint condition to adjust the estimation result of the microphone array on the position of the speaker.
Next, the system feeds back the obtained speaker position information to the system image camera to capture the speaker image.
Finally, the speaker voice is identified or enhanced according to the information, and the result is finally presented to the user, wherein the result can be in the form of dynamic subtitles or a conference record with the speaker image.
The hardware part of the system comprises: the system comprises a video conference terminal, an image camera, a depth camera, two microphone arrays A and B and a television.
The method and the system realize that the specific interested speaker can be automatically positioned and tracked according to the selection of the user in the video conference process, thereby further realizing the enhancement of specific voice and the audio signal processing to present dynamic subtitles or conference records for the user. The scheme has the advantages of real-time, simplicity, convenience and quickness, and has the characteristics of more accurate and real-time positioning and tracking.
The interested audio signal processing, enhancing and exhibiting are specifically as follows:
the method can calculate the position of the speaker through the microphone array, acquire the relative position information of the microphone array by combining the image and the depth camera, and finally associate the speaker with the depth/image camera and determine the position relation. The user can set the voice as the interesting voice when a speaker in the system speaks to extract the voice of the speaker; the speaker's voice can also be extracted by selecting a participant in the video image of the system, and then setting his voice as the voice of interest. In addition, the voice in the direction of the interested voice can be enhanced by utilizing a beam forming algorithm, and the voice in the direction of the non-interested voice can be inhibited. The face detection algorithm can also be utilized to obtain the head portrait of the speaker, and the head portrait of the speaker and the content information in the conference process are displayed to the user by combining the audio signal processing algorithm.
Fig. 12 is a schematic structural diagram of an audio signal and image processing system according to an embodiment of the present invention, and as shown in fig. 12, the audio signal processing method of interest: the user selects a participant in the video image of the system, and takes the participant as an interested voice speaker, and the steps are as follows:
step 1: detecting whether a local speaker speaks in real time during system operation, if the local speaker speaks, estimating the position of the speaker by using a microphone array, and converting the position of the speaker to the left and right positions and the depth axis position of a three-dimensional space of a depth camera;
step 2: the local or remote participant selects the region where the interested speaker is located in the video image by using a mouse or touch control, and the person in the region is used as the interested speaker;
and step 3: the system determines the face characteristics of the interested speaker, tracks the interested speaker by using a face tracking algorithm, updates the position of the interested voice speaker in real time and converts the position of the interested voice speaker into the position of the speaker estimated by the microphone array;
and 4, step 4: and enhancing the voice in the direction of the interested voice and inhibiting the voice in the direction of the non-interested voice by utilizing a beam forming algorithm.
The interested voice display method specifically comprises the following steps:
after the interested voice is processed by the audio signal, the speaking content of the speaker can be obtained. If a user needs to use an interested audio signal processing and enhancing method to identify an interested speaker, a face region image is obtained directly in a selected region through a face detection and tracking algorithm, the voice of the interested speaker can be identified through the interested audio signal processing and enhancing method, so that the system can obtain the content (text mode) of the speaking of a certain interested speaker in a certain time period and the face image of the interested speaker, and finally a static image-text conference record or a real-time caption which is easy for the user to watch and trace can be presented to the user by utilizing the information.
Of course, the above is to record the voice content of the specific speaker, and if the voice content of all people in the whole conference process is recorded, the recording process is different. Firstly: in the conference, the system performs real-time face recognition on the image collected by the image camera to determine the facial features of all the participants in the image visual field, and performs real-time detection to deal with the dynamic temporary departure or increase of the participants in the conference. The following steps are carried out: when the participants speak, the relative microphone array positions of all the participants are determined by the method, so that the voices of speakers (a plurality of speakers can be) are enhanced, the voices of the speakers are identified and stored in a text mode, real-time subtitles or complete conference records are generated by combining head images of the speakers extracted from the image cameras, the conference records are stored by taking time as a reference, and corresponding editing operations such as filtering, screening and the like are supported. As shown in fig. 13, fig. 13 is a schematic diagram of a text display method corresponding to the interested voice.
Example 6
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:
s1, calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of the object to be detected;
s2, filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected;
and S3, correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.
Optionally, the storage medium is further arranged to store program code for performing the steps of:
s1, calculating according to a first preset algorithm based on the audio signals collected by the microphones, and obtaining a first predicted position of the object to be detected includes: classifying a plurality of microphones into a first microphone array and a second microphone array; calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm; and calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.
Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Further, optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm comprises the following steps: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.
Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm comprises: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.
Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating after filtering the historical position of the object to be detected according to a second preset algorithm, and obtaining a second predicted position of the object to be detected comprises: respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array by a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.
Further, optionally, the storage medium is further configured to store program code for performing the steps of: and updating the Kalman filter parameters according to the current position of the object to be detected.
Further, optionally, the storage medium is further arranged to store program code for performing the steps of: after obtaining the current position of the object to be detected, the method further comprises: and enhancing the voice output of the object to be detected.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A method of audio signal processing, comprising:
calculating according to a first preset algorithm according to the audio signals collected by the microphones to obtain a first predicted position of the object to be detected;
filtering the historical position of the object to be detected according to a second preset algorithm and then calculating to obtain a second predicted position of the object to be detected;
correcting according to the continuity of the audio signal in time by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected;
wherein, the calculating according to the first preset algorithm and the audio signals collected by the microphones to obtain the first predicted position of the object to be detected comprises:
classifying the microphones into a first microphone array and a second microphone array;
calculating a first included angle between the object to be detected and the first microphone array according to the first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm;
according to a preset trigonometric function, calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle;
the step of calculating after filtering the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected comprises:
respectively calculating a first estimation value set of a first prediction angle of the first microphone array and a second estimation value set of a second prediction angle of the second microphone array by the first preset algorithm;
under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimation value set and the second estimation value set meet preset conditions through the Kalman filtering algorithm;
determining the first included angle and the second included angle according to the judgment result;
and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.
2. The method according to claim 1, wherein the calculating a first angle between the object to be detected and the first microphone array according to the first preset algorithm comprises:
under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array;
calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle;
calculating a mean of the set of estimated values for the first angle, and determining the mean as the first angle.
3. The method according to claim 1, wherein the calculating a second angle between the object to be detected and the second microphone array according to the first preset algorithm comprises:
under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array;
calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle;
and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.
4. The method according to claim 1, wherein after obtaining the current position of the object to be detected, the method further comprises:
and updating the Kalman filter parameters according to the current position of the object to be detected.
5. The method according to any one of claims 1 to 4, wherein after obtaining the current location of the object to be detected, the method further comprises:
and enhancing the voice output of the object to be detected.
6. An apparatus for audio signal processing, comprising:
the first calculation module is used for calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of the object to be detected;
the second calculation module is used for calculating the historical position of the object to be detected after filtering according to a second preset algorithm to obtain a second predicted position of the object to be detected;
the correcting module is used for correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected;
the first calculation module is further configured to classify the microphones into a first microphone array and a second microphone array, calculate a first included angle between the object to be detected and the first microphone array according to the first preset algorithm, calculate a second included angle between the object to be detected and the second microphone array according to the first preset algorithm, and calculate a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function;
the second calculating module is further configured to calculate a first estimated value set of a first prediction angle of the first microphone array and a second estimated value set of a second prediction angle of the second microphone array through the first preset algorithm, respectively, judge whether the first estimated value set and the second estimated value set satisfy preset conditions through the kalman filter algorithm, determine the first included angle and the second included angle according to a judgment result, and calculate through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.
CN201610826122.5A 2016-09-14 2016-09-14 Audio signal, image processing method, device and system Active CN107820037B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610826122.5A CN107820037B (en) 2016-09-14 2016-09-14 Audio signal, image processing method, device and system
PCT/CN2017/097397 WO2018049957A1 (en) 2016-09-14 2017-08-14 Audio signal, image processing method, device, and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610826122.5A CN107820037B (en) 2016-09-14 2016-09-14 Audio signal, image processing method, device and system

Publications (2)

Publication Number Publication Date
CN107820037A CN107820037A (en) 2018-03-20
CN107820037B true CN107820037B (en) 2021-03-26

Family

ID=61600778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610826122.5A Active CN107820037B (en) 2016-09-14 2016-09-14 Audio signal, image processing method, device and system

Country Status (2)

Country Link
CN (1) CN107820037B (en)
WO (1) WO2018049957A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312295B (en) * 2018-12-12 2022-06-28 深圳市冠旭电子股份有限公司 Holographic sound recording method and device and recording equipment
CN109683135A (en) * 2018-12-28 2019-04-26 科大讯飞股份有限公司 A kind of sound localization method and device, target capturing system
CN109547735B (en) * 2019-01-18 2024-04-16 海南科先电子科技有限公司 Conference integration system
CN110660102B (en) * 2019-06-17 2020-10-27 腾讯科技(深圳)有限公司 Speaker recognition method, device and system based on artificial intelligence
CN110632582B (en) * 2019-09-25 2022-03-29 苏州科达科技股份有限公司 Sound source positioning method, device and storage medium
CN110730378A (en) * 2019-11-01 2020-01-24 联想(北京)有限公司 Information processing method and system
CN112868061A (en) * 2019-11-29 2021-05-28 深圳市大疆创新科技有限公司 Environment detection method, electronic device and computer-readable storage medium
CN112198498A (en) * 2020-09-11 2021-01-08 海创半导体科技(深圳)有限公司 Method for measuring distance by using intelligent voice module

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1460185A (en) * 2001-03-30 2003-12-03 皇家菲利浦电子有限公司 Method and apparatus for audio-image speaker detection and location
CN101030323A (en) * 2007-04-23 2007-09-05 凌子龙 Automatic evidence collecting device on crossroad for vehicle horning against traffic regulation
CN101377885A (en) * 2007-08-28 2009-03-04 凌子龙 Electronic workstation for obtaining evidence of vehicle peccancy whistle and method thereof
CN102256098A (en) * 2010-05-18 2011-11-23 宝利通公司 Videoconferencing endpoint having multiple voice-tracking cameras
CN204539315U (en) * 2015-04-02 2015-08-05 尹煜敏 A kind of video conference machine of auditory localization
CN105607042A (en) * 2014-11-19 2016-05-25 北京航天长峰科技工业集团有限公司 Method for locating sound source through microphone array time delay estimation
CN105657329A (en) * 2016-02-26 2016-06-08 苏州科达科技股份有限公司 Video conference system, processing device and video conference method

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7039199B2 (en) * 2002-08-26 2006-05-02 Microsoft Corporation System and process for locating a speaker using 360 degree sound source localization
US7924655B2 (en) * 2007-01-16 2011-04-12 Microsoft Corp. Energy-based sound source localization and gain normalization
CN101201399B (en) * 2007-12-18 2012-01-11 北京中星微电子有限公司 Sound localization method and system
CN101656908A (en) * 2008-08-19 2010-02-24 深圳华为通信技术有限公司 Method for controlling sound focusing, communication device and communication system
CN102843543B (en) * 2012-09-17 2015-01-21 华为技术有限公司 Video conferencing reminding method, device and video conferencing system
CN103841357A (en) * 2012-11-21 2014-06-04 中兴通讯股份有限公司 Microphone array sound source positioning method, device and system based on video tracking
US9451360B2 (en) * 2014-01-14 2016-09-20 Cisco Technology, Inc. Muting a sound source with an array of microphones
CN105588543B (en) * 2014-10-22 2019-10-18 中兴通讯股份有限公司 A kind of method, apparatus and positioning system for realizing positioning based on camera

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1460185A (en) * 2001-03-30 2003-12-03 皇家菲利浦电子有限公司 Method and apparatus for audio-image speaker detection and location
CN101030323A (en) * 2007-04-23 2007-09-05 凌子龙 Automatic evidence collecting device on crossroad for vehicle horning against traffic regulation
CN101377885A (en) * 2007-08-28 2009-03-04 凌子龙 Electronic workstation for obtaining evidence of vehicle peccancy whistle and method thereof
CN102256098A (en) * 2010-05-18 2011-11-23 宝利通公司 Videoconferencing endpoint having multiple voice-tracking cameras
CN105607042A (en) * 2014-11-19 2016-05-25 北京航天长峰科技工业集团有限公司 Method for locating sound source through microphone array time delay estimation
CN204539315U (en) * 2015-04-02 2015-08-05 尹煜敏 A kind of video conference machine of auditory localization
CN105657329A (en) * 2016-02-26 2016-06-08 苏州科达科技股份有限公司 Video conference system, processing device and video conference method

Also Published As

Publication number Publication date
CN107820037A (en) 2018-03-20
WO2018049957A1 (en) 2018-03-22

Similar Documents

Publication Publication Date Title
CN107820037B (en) Audio signal, image processing method, device and system
US10122972B2 (en) System and method for localizing a talker using audio and video information
US10491809B2 (en) Optimal view selection method in a video conference
US9633270B1 (en) Using speaker clustering to switch between different camera views in a video conference system
US6005610A (en) Audio-visual object localization and tracking system and method therefor
Zhou et al. Target detection and tracking with heterogeneous sensors
EP3546976B1 (en) Device control method, apparatus and system
Donley et al. Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments
GB2342802A (en) Indexing conference content onto a timeline
CN111432115B (en) Face tracking method based on voice auxiliary positioning, terminal and storage device
US10582117B1 (en) Automatic camera control in a video conference system
CN103581608A (en) Spokesman detecting system, spokesman detecting method and audio/video conference system
CN103841357A (en) Microphone array sound source positioning method, device and system based on video tracking
Kapralos et al. Audiovisual localization of multiple speakers in a video teleconferencing setting
EP3101838A1 (en) Method and apparatus for isolating an active participant in a group of participants
Liu et al. Multiple speaker tracking in spatial audio via PHD filtering and depth-audio fusion
KR101884446B1 (en) Speaker identification and speaker tracking method for Multilateral conference environment
CN111551921A (en) Sound source orientation system and method based on sound image linkage
US20170215005A1 (en) Audio refocusing methods and electronic devices utilizing the same
JP2005141687A (en) Method, device, and system for object tracing, program, and recording medium
Li et al. Multiple active speaker localization based on audio-visual fusion in two stages
Pingali et al. Audio-visual tracking for natural interactivity
Brunelli et al. A generative approach to audio-visual person tracking
CN104780341B (en) A kind of information processing method and information processing unit
Izhar et al. Tracking sound sources for object-based spatial audio in 3D audio-visual production

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20180426

Address after: No. 55, Nanshan District science and technology road, Nanshan District, Shenzhen, Guangdong

Applicant after: ZTE Corporation

Address before: 210012 No. 68 Bauhinia Road, Yuhuatai District, Jiangsu, Nanjing

Applicant before: Nanjing Zhongxing New Software Co., Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant