CN107820037B

CN107820037B - Audio signal, image processing method, device and system

Info

Publication number: CN107820037B
Application number: CN201610826122.5A
Authority: CN
Inventors: 任志平
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2016-09-14
Filing date: 2016-09-14
Publication date: 2021-03-26
Anticipated expiration: 2036-09-14
Also published as: CN107820037A; WO2018049957A1

Abstract

The invention provides a method, a device and a system for processing audio signals and images, wherein the method, the device and the system are used for calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of an object to be detected; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; the current position of the object to be detected is obtained by combining the first prediction position and the second prediction position and correcting according to the time continuity of the audio signal, the problems that the position of a speaker cannot be displayed in time and the multimedia information of the speaker cannot be obtained in a tracking mode in a remote video conference system due to the fact that a speaker position tracking technology is lacked are solved, and the effects of obtaining the position of the speaker in time and obtaining the multimedia information of the speaker in a tracking mode are achieved.

Description

Audio signal, image processing method, device and system

Technical Field

The invention relates to the field of application of voice recognition technology, in particular to a method, a device and a system for processing audio signals and images.

Background

With the rapid development of video communication technology, teleconferencing television services are increasingly emerging. In the use process of the remote video conference system, how to position and display the voice of the speaker through the equipment becomes a problem to be solved in the existing remote video conference system.

Aiming at the problems that the position of a speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and acquired in a remote video conference system due to the lack of a speaker position tracking technology in the related technology, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides an audio signal and image processing method, device and system, which are used for at least solving the problems that the position of a speaker cannot be displayed in time and the multimedia information of the speaker cannot be obtained in a tracking manner in a remote video conference system due to the lack of a speaker position tracking technology in the related technology.

According to an embodiment of the present invention, there is provided an audio signal processing method including: calculating according to a first preset algorithm according to the audio signals collected by the microphones to obtain a first predicted position of the object to be detected; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.

Optionally, calculating according to a first preset algorithm based on the audio signals collected by the multiple microphones to obtain a first predicted position of the object to be detected includes: classifying a plurality of microphones into a first microphone array and a second microphone array; calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm; and calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.

Further, optionally, calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm includes: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.

Optionally, calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm includes: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.

Optionally, the filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected includes: respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array by a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.

Further, optionally, after obtaining the current position of the object to be detected, the method further includes: and updating the Kalman filter parameters according to the current position of the object to be detected.

Further, optionally, after obtaining the current position of the object to be detected, the method further includes: and enhancing the voice output of the object to be detected.

According to another embodiment of the present invention, there is provided an image processing method including: acquiring a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array; respectively calculating a first included angle between a first microphone array and the image acquisition equipment, which corresponds to the first depth value, and calculating a second included angle between a second microphone array and the image acquisition equipment, which corresponds to the second depth value; constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; and acquiring the position of the object to be detected, and determining the position of the object to be detected in the multi-dimensional space coordinate system according to the multi-dimensional space coordinate system.

Optionally, respectively calculating a first type of included angle between the first microphone array and the image acquisition device, which corresponds to the first depth value, and calculating a second type of included angle between the second microphone array and the image acquisition device, which corresponds to the second depth value, include: and calculating a first type included angle and a second type included angle according to the preset conditions of the first depth, the second depth and the actual distance.

According to still another embodiment of the present invention, there is provided an apparatus for audio signal processing, including: the first calculation module is used for calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of the object to be detected; the second calculation module is used for calculating after filtering the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and the correcting module is used for correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.

According to still another embodiment of the present invention, there is provided an apparatus for image processing, including: acquiring a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array; the calculating module is used for calculating a first included angle between the first microphone array and the image acquisition equipment corresponding to the first depth value and calculating a second included angle between the second microphone array and the image acquisition equipment corresponding to the second depth value; the coordinate space module is used for constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; and the acquisition module is used for acquiring the position of the object to be detected and determining the position of the object to be detected in the multi-dimensional space coordinate system according to the multi-dimensional space coordinate system.

According to an embodiment of the present invention, there is provided a voice, image processing system including: the system comprises a video conference terminal, image acquisition equipment, depth image acquisition equipment, a sound acquisition module consisting of a plurality of microphone arrays and display equipment, wherein the sound acquisition module consisting of the plurality of microphone arrays is used for acquiring audio signals of an object to be detected; the image acquisition equipment is used for acquiring all video images in the meeting place; the depth image acquisition equipment is used for acquiring a depth image in the meeting place, and the depth image is used for acquiring position information between the participants and the depth image acquisition equipment; and the video conference terminal is used for tracking the positions of the participants, displaying the images of the participants during speaking and recording the conference.

According to still another embodiment of the present invention, there is also provided a storage medium. The storage medium is configured to store program code for performing the steps of: calculating according to a first preset algorithm according to the audio signals collected by the microphones to obtain a first predicted position of the object to be detected; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.

Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating according to a first preset algorithm according to the audio signals collected by the microphones, and obtaining a first predicted position of the object to be detected comprises: classifying a plurality of microphones into a first microphone array and a second microphone array; calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm; and calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.

Further, optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm comprises the following steps: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.

Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm comprises: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, Euclidean distances among audio signals collected by all microphones in the second microphone array are calculated; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.

Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating after filtering the historical position of the object to be detected according to a second preset algorithm, and obtaining a second predicted position of the object to be detected comprises: respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array by a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.

Further, optionally, the storage medium is further configured to store program code for performing the steps of: after obtaining the current position of the object to be detected, the method further comprises: and updating the Kalman filter parameters according to the current position of the object to be detected.

Further, optionally, the storage medium is further arranged to store program code for performing the steps of: after obtaining the current position of the object to be detected, the method further comprises: and enhancing the voice output of the object to be detected.

According to the invention, the first predicted position of the object to be detected is obtained by calculating according to the audio signals collected by the microphones according to the first preset algorithm; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a flow chart of a method of audio signal processing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the relationship between two microphone arrays and the speaker position in the audio signal processing method according to the embodiment of the invention;

FIG. 3 is a schematic diagram illustrating the calculation of the speaker position relative to the microphone array in the method for processing audio signals according to the embodiment of the invention;

FIG. 4 is a schematic diagram of the TDOA algorithm in the method of audio signal processing according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of joint multi-head pair TDOA algorithm location in a method of audio signal processing according to an embodiment of the present invention;

FIG. 6 is a flow diagram of a method of image processing according to an embodiment of the present invention;

FIG. 7 is a system device layout diagram of a method of image processing according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a method for image processing using a microphone array to measure television distance in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of an angle between a depth axis of a depth camera and a connecting line of a microphone array according to depth information in the method for processing an image according to the embodiment of the present invention;

fig. 10 is a schematic structural diagram of an apparatus for audio signal processing according to an embodiment of the present invention;

fig. 11 is a schematic configuration diagram of an apparatus for image processing according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an audio signal and image processing system according to an embodiment of the present invention

FIG. 13 is a diagram illustrating a text display method corresponding to an interesting voice.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The technical terms related to the embodiments of the present application are:

TDOA: the Time Difference of Arrival algorithm, Time Difference of Arrival.

Example 1

Fig. 1 is a flowchart of a method of audio signal processing according to an embodiment of the present invention, as shown in fig. 1, the flowchart including the steps of:

step S102, calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of an object to be detected;

step S104, filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected;

and S106, correcting according to the time continuity of the audio signal by combining the first prediction position and the second prediction position to obtain the current position of the object to be detected.

Through the steps, the first predicted position of the object to be detected is obtained by calculating according to the first preset algorithm and the audio signals collected by the microphones; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.

The audio signal processing method provided by the embodiment of the application can be applied to a sound source tracking and positioning technology, wherein the sound source positioning technology has high application prospect and use value, for example, the method can be used for detecting the position of a speaker and automatically focusing a video image to the speaker, so that a listener can better observe a speaker and even can perceive the fine facial expression of the speaker, and therefore the listener can listen to stronger live feeling and better understand and feel the content to be expressed by the speaker.

Optionally, in step S102, calculating according to a first preset algorithm based on the audio signals collected by the multiple microphones to obtain a first predicted position of the object to be detected includes:

step1, classifying the microphones into a first microphone array and a second microphone array;

step2, calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm;

step3, calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.

Further, optionally, calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm in Step2 includes:

under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.

Optionally, calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm in Step2 includes: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.

Optionally, in step S104, the filtering and calculating the historical position of the object to be detected according to the second preset algorithm, and obtaining the second predicted position of the object to be detected includes:

respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array through a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.

Further, optionally, after obtaining the current position of the object to be detected in step S106, the method for processing an audio signal provided in the embodiment of the present invention further includes: and updating the Kalman filter parameters according to the current position of the object to be detected.

Further, optionally, after obtaining the current position of the object to be detected in step S106, the method for processing an audio signal provided in the embodiment of the present invention further includes: and enhancing the voice output of the object to be detected.

To sum up, the method for processing an audio signal provided by the embodiment of the present invention specifically includes:

fig. 2 is a schematic diagram of the relationship between two microphone arrays and the speaker in the method for processing audio signals according to the embodiment of the present invention, and as shown in fig. 2, assuming that a certain conference room has two microphone arrays MicA and MicB, each having four microphones, the MicA/MicB array is regarded as a set of microphones, i.e., MicA ═ { MicA0, MicA1, MicA2, MicA3}, and MicB ═ MicB0, MicB1, MicB2, MicB3 }. In general, only one person speaks in a certain time period in a certain meeting place in the video conference process, so we assume that only one person speaks in a certain meeting place at time t, and the positional relationship of the speaker with respect to MicA and MicB is shown in fig. 2.

At this time, the included angle between the speaker and MicA and MicB is not zero, and the included angle between the speaker and MicA is assumed to be theta₀The included angle between the speaker and MicB is theta₁Since the distance between MicA and MicB is known, the speaker position can be easily predicted according to trigonometric theorem, as shown in fig. 3, fig. 3 is a schematic diagram illustrating the speaker position calculation relative to the microphone array in the audio signal processing method according to the embodiment of the present invention.

Angle theta between speaker and MicA/MicB₀And theta₁Can be derived from a TDOA equal time delay estimation algorithm, as shown in fig. 4, fig. 4 is a schematic diagram of a TDOA algorithm in the audio signal processing method according to the embodiment of the present invention.

Assuming that the speed of speech propagation is fixed at gamma, the angle theta between the sound source and MicA0/MicA1₀(the line between MicA0 and MicA1 is parallel to the line between MicA and MicB), and MicA0 and MicA1 are spaced l apart₀Due to the different distances of the sound source from MicA0 and MicA1, there is a difference in the time to reach MicA1 and MicA0 from the sound source, with a time difference Δ t:

Δt＝l₀cosθ₀/γ

the above differential information appears on the microphone MicA0 and MicA1, namely MicA0 has time delay compared with the voice sequence sampled by MicA1, and the maximum time delay of MicA0 and MicA1 does not exceed l, assuming that the sampling rates of MicA0 and MicA1 are S₀And/gamma. Under the constraint, MicA0 sampled speech sequence X ═ { X ═ X₀,x₁,x₂,…,x_nY-Y of speech sequence sampled with MicA1₀，y₁，y₂,…,y_nX is in mu e | -S |₀/γ,S*l₀The offset between/γ |, yields X ═ X_0+μ,x_1+μ,x_2+μ,…,x_n+μThe Euclidean distance between X' and Y is:

wherein δ | μ ∈ [ -S |₀/γ,S*l₀/γ]Having a minimum value delta_min，δ_minCorresponding offset mu | delta_minAccording to μ | δ_minAvailable speaker and MicA0 and]included angle theta between connecting lines of MicA1₀：

MicA has four heads { MicA0, MicA1, MicA2, MicA3}, and there are 6 pairs of heads { MicA0, MicA1}, { MicA1, MicA2}, { MicA2, MicA3}, { MicA0, MicA2}, and { MicA1, MicA3}, as shown in FIG. 4. A set of estimates of the speaker's direction { theta ] can be obtained for 6 microphone pairs_0,0，θ_0,1，θ_0,2，θ_0,3，θ_0,4，θ_0,5Get their mean value

As a result of prediction of speaker direction

It is verified by experiment that a deviation of 5 ° is allowed. Fig. 5 is a schematic diagram of joint multi-head pair TDOA algorithm location in a method of audio signal processing according to an embodiment of the present invention.

The same algorithm is adopted for four heads { MicB0, MicB1, MicB2 and MicB3} of MicB to obtain theta₁Predicted result of (2)

The speaker position prediction result is obtained by simple trigonometric function operation from FIG. 3

Through the algorithm, a series of prediction results of the speaker position can be obtained within a period of time

However, due to interference of noise and the like, prediction results obtained by the algorithm are not accurate enough, so that the position of the speaker is predicted and tracked based on Kalman filtering, the position is used as a constraint condition for predicting the direction angle of the speaker, and the accuracy of the TDOA algorithm positioning by the combined multiple microphones is improved.

The method comprises the following steps: predicting the position of a speaker at the current time by Kalman filtering

And converted into the prediction of the included angle relative to the connecting line of MicA and MicB

And

step two: for each microphone, calculating the time delay of each microphone pair by using a TDOA algorithm and converting the time delay into an included angle relative to a connecting line of MicA and MicB to obtain an estimated value of a group of speaker directions: { theta ]_i,0，θ_i,1，θ_i,2，θ_i,3，θ_i,4，θ_i,5}；

Step three: if it is

The prediction result of Kalman filtering is considered to be too large in deviation and needs to be abandoned directly

As a result of the prediction of the speaker's direction at the current time

Otherwise, the prediction result of Kalman filtering is considered to be acceptable, and

is excluded, i.e., U θ ═ θ'_i,0,θ‘_i,1,…,θ‘_i,n-1}，

1<＝n<＝6，1<＝j<When n is equal to n, the

As a result of the prediction of the speaker's direction at the current time

Step four: step two and step three are executed to the two microphone arrays to obtain the prediction result of the speaker direction at the current moment

And

and obtaining the speaker position according to simple trigonometric function operation

And updating the kalman filter parameters.

Example 2

Fig. 6 is a flowchart of a method of image processing according to an embodiment of the present invention, as shown in fig. 6, the flowchart including the steps of:

step S602, acquiring a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array;

step S604, respectively calculating a first included angle between a first microphone array and the image acquisition equipment, which corresponds to the first depth value, and calculating a second included angle between a second microphone array and the image acquisition equipment, which corresponds to the second depth value;

step S606, constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle;

step S608, acquiring a position of the object to be detected, and determining the position of the object to be detected in the multidimensional space coordinate system according to the multidimensional space coordinate system.

Through the steps, the first depth value of the first microphone array and the image acquisition device of the display device and the second depth value of the second microphone array and the image acquisition device of the display device are obtained through the preset microphone array; respectively calculating a first included angle between a first microphone array and the image acquisition equipment, which corresponds to the first depth value, and calculating a second included angle between a second microphone array and the image acquisition equipment, which corresponds to the second depth value; constructing a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; and acquiring the position of the object to be detected, and determining the position of the object to be detected in the multi-dimensional space coordinate system according to the multi-dimensional space coordinate system. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.

Optionally, the step S604 of respectively calculating a first type of included angle between the first microphone array and the image capturing device corresponding to the first depth value, and calculating a second type of included angle between the second microphone array and the image capturing device corresponding to the second depth value includes: and calculating a first type included angle and a second type included angle according to the preset conditions of the first depth, the second depth and the actual distance.

In summary, the image processing method provided in the embodiment of the present application is specifically as follows:

the system requires the relative positions of the microphone array, the depth camera, the image camera and the television to be fixed, as shown in fig. 7 below, and fig. 7 is a layout diagram of system equipment of the image processing method according to the embodiment of the invention.

In the system, the distance between MicA and MicB is known, the distance is generally 2-3 m, the width of a television can be measured, and the connecting line between the television and MicA and MicB is kept horizontal. The distance between the television and the connecting lines between MicA and MicB is unknown, and the television is placed according to the area of a conference room. When the system is installed for the first time, the video conference device controls the tv to play a pre-recorded voice, and estimates the positions (including directions and distances) of the tv relative to MicA and MicB through the above-mentioned joint multi-microphone-pair TDOA algorithm, as shown in fig. 8, fig. 8 is a schematic diagram of measuring tv distances by using a microphone array in the image processing method according to the embodiment of the present invention.

Because the microphone array has special shape and color, the microphone array can be identified in the image camera, and corresponding Depth information can be obtained in the Depth camera, assuming that the Depth of MicA is Depth0 and the Depth of MicB is Depth1, the angle of the camera relative to MicA and MicB can be calculated by utilizing trigonometric function.

Fig. 9 is a schematic diagram of an angle between a depth axis of a depth camera and a connection line of a microphone array according to depth information in an image processing method according to an embodiment of the present invention, as shown in fig. 9, since the microphone array is located in a direction and a position relative to the microphone array, and the depth camera is used for locating in a direction and a position relative to the depth camera, a system needs to use two kinds of information to achieve accurate speaker location, and a coordinate system needs to be converted. The microphone array used in the system can only be positioned to a two-dimensional space position, and corresponds to two coordinate axes of left and right and depth in the depth camera, namely an x axis and a z axis. Let us assume that, in the two-dimensional space of the microphone array, the coordinates of MicA are (0,0) and the coordinates of MicB are (length,0), where length is the microphone array pitch. After the positioning by the microphone array, the coordinates of the depth camera (at the same position as the television) are (x, y), and the direction relative to MicA is theta₀In a direction of θ with respect to MicB₁. The depth of the MicA array in the depth camera is depth0, MiThe depth of cB is depth 1. According to the above information, although the depth information is not equal to the actual distance, the actual distance satisfies:

y＝f(depth)

wherein y is₀And y₁Is the actual distance of MicA and MicB in the depth direction relative to the depth camera. From the trigonometric function we can obtain:

namely:

from the knowledge of the triangle geometry, θ 2 and θ 3 satisfy:

θ₂+θ₃＝θ₀+θ₁

the final calculation can be:

θ₂＝(θ₀+θ₁)-θ₃

note that only the case where θ 2 and θ 3 and θ 0 and θ 1 are both acute angles is analyzed here, and the other cases are similar. According to the method, the two-dimensional space position where the microphone array is positioned relative to the microphone array can be converted into the left-right and depth axis positions in the three-dimensional space of the depth camera.

Since the user changes the camera angle (the position of the microphone array is fixed) during the use process, once the camera angle changes, the system must have the automatic updating parameters, and the operation is carried out again to convert the two-dimensional space position of the microphone array into the left and right and depth axis positions in the three-dimensional space of the depth camera. And because the user can change the angle of the camera in the meeting process, the preset recording can not be played, the system can determine that only a television plays and no local speaker speaks in a certain time period by utilizing single-remote judgment in an echo cancellation algorithm, thereby ensuring that the operation is not interfered and the calculation result is accurate enough. The method can convert the position of the speaker estimated by the microphone array to the position in the depth/image camera, and then the position of the speaker in the depth/image camera can be obtained by algorithms such as skin color detection, face detection and the like.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 3

In this embodiment, an audio signal processing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 10 is a schematic structural diagram of an apparatus for audio signal processing according to an embodiment of the present invention, as shown in fig. 10, the apparatus including:

the first calculating module 1002 is configured to calculate according to a first preset algorithm based on audio signals acquired by multiple microphones to obtain a first predicted position of an object to be detected; the second calculating module 1004 is configured to calculate after filtering the historical position of the object to be detected according to a second preset algorithm, so as to obtain a second predicted position of the object to be detected; the correcting module 1006 is configured to correct the first predicted position and the second predicted position according to the temporal continuity of the audio signal, so as to obtain a current position of the object to be detected.

In the audio signal processing device according to the embodiment of the invention, the first predicted position of the object to be detected is obtained by calculating according to the audio signals collected by the microphones according to the first preset algorithm; filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected; and correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.

Example 4

Fig. 11 is a schematic structural diagram of an apparatus for image processing according to an embodiment of the present invention, as shown in fig. 11, the apparatus including:

the acquisition module 1102 is configured to acquire a first depth value of the first microphone array and an image acquisition device of the display device and a second depth value of the second microphone array and the image acquisition device of the display device through a preset microphone array; a calculating module 1104, configured to calculate a first type of included angle between the first microphone array and the image capturing device corresponding to the first depth value and calculate a second type of included angle between the second microphone array and the image capturing device corresponding to the second depth value, respectively; a coordinate space module 1106, configured to construct a multi-dimensional space coordinate system according to the first depth value, the second depth value, the first included angle and the second included angle; the obtaining module 1108 is configured to obtain a position of the object to be detected, and determine the position of the object to be detected in the multidimensional space coordinate system according to the multidimensional space coordinate system.

In the image processing apparatus according to the embodiment of the present invention, the multi-dimensional space coordinate system is constructed according to the first depth value, the second depth value, the first included angle, and the second included angle. Therefore, the problems that the position of the speaker cannot be displayed in time and the multimedia information of the speaker cannot be tracked and obtained in the remote video conference system due to the lack of the position tracking technology of the speaker can be solved, and the effects of timely obtaining the position of the speaker and tracking and obtaining the multimedia information of the speaker can be achieved.

Example 5

According to an embodiment of the present invention, there is provided an audio signal, image processing system including: the system comprises a video conference terminal, image acquisition equipment, depth image acquisition equipment, a sound acquisition module consisting of a plurality of microphone arrays and display equipment, wherein the sound acquisition module consisting of the plurality of microphone arrays is used for acquiring audio signals of an object to be detected; the image acquisition equipment is used for acquiring all video images in the meeting place; the depth image acquisition equipment is used for acquiring a depth image in the meeting place, and the depth image is used for acquiring position information between the participants and the depth image acquisition equipment; and the video conference terminal is used for tracking the positions of the participants, displaying the images of the participants during speaking and recording the conference.

In summary, with reference to embodiments 1 to 5, the audio signal and image processing method, apparatus and system provided in the embodiments of the present application are specifically as follows:

firstly, the system predicts and tracks the position of the speaker in real time according to a TDOA algorithm by combining multiple microphones, predicts and tracks the position of the speaker by using Kalman filtering, and carries out self-correction according to the continuity of voice signals in time to obtain accurate estimation of the position of the speaker.

In addition, a depth camera is fixedly arranged in the system, and the depth information of each indoor participant is acquired through the depth camera and is used as a constraint condition to adjust the estimation result of the microphone array on the position of the speaker.

Next, the system feeds back the obtained speaker position information to the system image camera to capture the speaker image.

Finally, the speaker voice is identified or enhanced according to the information, and the result is finally presented to the user, wherein the result can be in the form of dynamic subtitles or a conference record with the speaker image.

The hardware part of the system comprises: the system comprises a video conference terminal, an image camera, a depth camera, two microphone arrays A and B and a television.

The method and the system realize that the specific interested speaker can be automatically positioned and tracked according to the selection of the user in the video conference process, thereby further realizing the enhancement of specific voice and the audio signal processing to present dynamic subtitles or conference records for the user. The scheme has the advantages of real-time, simplicity, convenience and quickness, and has the characteristics of more accurate and real-time positioning and tracking.

The interested audio signal processing, enhancing and exhibiting are specifically as follows:

the method can calculate the position of the speaker through the microphone array, acquire the relative position information of the microphone array by combining the image and the depth camera, and finally associate the speaker with the depth/image camera and determine the position relation. The user can set the voice as the interesting voice when a speaker in the system speaks to extract the voice of the speaker; the speaker's voice can also be extracted by selecting a participant in the video image of the system, and then setting his voice as the voice of interest. In addition, the voice in the direction of the interested voice can be enhanced by utilizing a beam forming algorithm, and the voice in the direction of the non-interested voice can be inhibited. The face detection algorithm can also be utilized to obtain the head portrait of the speaker, and the head portrait of the speaker and the content information in the conference process are displayed to the user by combining the audio signal processing algorithm.

Fig. 12 is a schematic structural diagram of an audio signal and image processing system according to an embodiment of the present invention, and as shown in fig. 12, the audio signal processing method of interest: the user selects a participant in the video image of the system, and takes the participant as an interested voice speaker, and the steps are as follows:

step 1: detecting whether a local speaker speaks in real time during system operation, if the local speaker speaks, estimating the position of the speaker by using a microphone array, and converting the position of the speaker to the left and right positions and the depth axis position of a three-dimensional space of a depth camera;

step 2: the local or remote participant selects the region where the interested speaker is located in the video image by using a mouse or touch control, and the person in the region is used as the interested speaker;

and step 3: the system determines the face characteristics of the interested speaker, tracks the interested speaker by using a face tracking algorithm, updates the position of the interested voice speaker in real time and converts the position of the interested voice speaker into the position of the speaker estimated by the microphone array;

and 4, step 4: and enhancing the voice in the direction of the interested voice and inhibiting the voice in the direction of the non-interested voice by utilizing a beam forming algorithm.

The interested voice display method specifically comprises the following steps:

after the interested voice is processed by the audio signal, the speaking content of the speaker can be obtained. If a user needs to use an interested audio signal processing and enhancing method to identify an interested speaker, a face region image is obtained directly in a selected region through a face detection and tracking algorithm, the voice of the interested speaker can be identified through the interested audio signal processing and enhancing method, so that the system can obtain the content (text mode) of the speaking of a certain interested speaker in a certain time period and the face image of the interested speaker, and finally a static image-text conference record or a real-time caption which is easy for the user to watch and trace can be presented to the user by utilizing the information.

Of course, the above is to record the voice content of the specific speaker, and if the voice content of all people in the whole conference process is recorded, the recording process is different. Firstly: in the conference, the system performs real-time face recognition on the image collected by the image camera to determine the facial features of all the participants in the image visual field, and performs real-time detection to deal with the dynamic temporary departure or increase of the participants in the conference. The following steps are carried out: when the participants speak, the relative microphone array positions of all the participants are determined by the method, so that the voices of speakers (a plurality of speakers can be) are enhanced, the voices of the speakers are identified and stored in a text mode, real-time subtitles or complete conference records are generated by combining head images of the speakers extracted from the image cameras, the conference records are stored by taking time as a reference, and corresponding editing operations such as filtering, screening and the like are supported. As shown in fig. 13, fig. 13 is a schematic diagram of a text display method corresponding to the interested voice.

Example 6

The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:

s1, calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of the object to be detected;

s2, filtering and calculating the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected;

and S3, correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected.

Optionally, the storage medium is further arranged to store program code for performing the steps of:

s1, calculating according to a first preset algorithm based on the audio signals collected by the microphones, and obtaining a first predicted position of the object to be detected includes: classifying a plurality of microphones into a first microphone array and a second microphone array; calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm; and calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Further, optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a first included angle between the object to be detected and the first microphone array according to a first preset algorithm comprises the following steps: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, Euclidean distances among audio signals collected by all microphones in the first microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle; and calculating the mean value of the estimation value set of the first included angle, and determining the mean value as the first included angle.

Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating a second included angle between the object to be detected and the second microphone array according to a first preset algorithm comprises: under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array; calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle; and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.

Optionally, the storage medium is further arranged to store program code for performing the steps of: calculating after filtering the historical position of the object to be detected according to a second preset algorithm, and obtaining a second predicted position of the object to be detected comprises: respectively calculating a first estimation value set of a first prediction angle of a first microphone array and a second estimation value set of a second prediction angle of a second microphone array by a first preset algorithm; under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimated value set and the second estimated value set meet preset conditions through the Kalman filtering algorithm; determining a first included angle and a second included angle according to the judgment result; and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.

Further, optionally, the storage medium is further configured to store program code for performing the steps of: and updating the Kalman filter parameters according to the current position of the object to be detected.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of audio signal processing, comprising:

calculating according to a first preset algorithm according to the audio signals collected by the microphones to obtain a first predicted position of the object to be detected;

filtering the historical position of the object to be detected according to a second preset algorithm and then calculating to obtain a second predicted position of the object to be detected;

correcting according to the continuity of the audio signal in time by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected;

wherein, the calculating according to the first preset algorithm and the audio signals collected by the microphones to obtain the first predicted position of the object to be detected comprises:

classifying the microphones into a first microphone array and a second microphone array;

calculating a first included angle between the object to be detected and the first microphone array according to the first preset algorithm, and calculating a second included angle between the object to be detected and the second microphone array according to the first preset algorithm;

according to a preset trigonometric function, calculating to obtain a first predicted position of the object to be detected through the first included angle and the second included angle;

the step of calculating after filtering the historical position of the object to be detected according to a second preset algorithm to obtain a second predicted position of the object to be detected comprises:

respectively calculating a first estimation value set of a first prediction angle of the first microphone array and a second estimation value set of a second prediction angle of the second microphone array by the first preset algorithm;

under the condition that the second preset algorithm is a Kalman filtering algorithm, respectively judging whether the first estimation value set and the second estimation value set meet preset conditions through the Kalman filtering algorithm;

determining the first included angle and the second included angle according to the judgment result;

and calculating through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.

2. The method according to claim 1, wherein the calculating a first angle between the object to be detected and the first microphone array according to the first preset algorithm comprises:

under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the first microphone array;

calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the first included angle to obtain an estimated value set of the first included angle;

calculating a mean of the set of estimated values for the first angle, and determining the mean as the first angle.

3. The method according to claim 1, wherein the calculating a second angle between the object to be detected and the second microphone array according to the first preset algorithm comprises:

under the condition that the first preset algorithm is a time difference of arrival algorithm TDOA, calculating Euclidean distances among audio signals collected by all microphones in the second microphone array;

calculating according to the relation between the Euclidean distance between the audio signals collected by each microphone and the second included angle to obtain an estimated value set of the second included angle;

and calculating the mean value of the estimation value set of the second included angle, and determining the mean value as the second included angle.

4. The method according to claim 1, wherein after obtaining the current position of the object to be detected, the method further comprises:

and updating the Kalman filter parameters according to the current position of the object to be detected.

5. The method according to any one of claims 1 to 4, wherein after obtaining the current location of the object to be detected, the method further comprises:

and enhancing the voice output of the object to be detected.

6. An apparatus for audio signal processing, comprising:

the first calculation module is used for calculating according to a first preset algorithm and audio signals collected by a plurality of microphones to obtain a first predicted position of the object to be detected;

the second calculation module is used for calculating the historical position of the object to be detected after filtering according to a second preset algorithm to obtain a second predicted position of the object to be detected;

the correcting module is used for correcting according to the time continuity of the audio signal by combining the first predicted position and the second predicted position to obtain the current position of the object to be detected;

the first calculation module is further configured to classify the microphones into a first microphone array and a second microphone array, calculate a first included angle between the object to be detected and the first microphone array according to the first preset algorithm, calculate a second included angle between the object to be detected and the second microphone array according to the first preset algorithm, and calculate a first predicted position of the object to be detected through the first included angle and the second included angle according to a preset trigonometric function;

the second calculating module is further configured to calculate a first estimated value set of a first prediction angle of the first microphone array and a second estimated value set of a second prediction angle of the second microphone array through the first preset algorithm, respectively, judge whether the first estimated value set and the second estimated value set satisfy preset conditions through the kalman filter algorithm, determine the first included angle and the second included angle according to a judgment result, and calculate through the first included angle and the second included angle according to a preset trigonometric function to obtain a second predicted position of the object to be detected.