CN107820037A

CN107820037A - The methods, devices and systems of audio signal, image procossing

Info

Publication number: CN107820037A
Application number: CN201610826122.5A
Authority: CN
Inventors: 任志平
Original assignee: Nanjing ZTE New Software Co Ltd
Current assignee: ZTE Corp
Priority date: 2016-09-14
Filing date: 2016-09-14
Publication date: 2018-03-20
Anticipated expiration: 2036-09-14
Also published as: CN107820037B; WO2018049957A1

Abstract

The invention provides a kind of audio signal, the methods, devices and systems of image procossing, are calculated by the present invention according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain the first predicted position of object to be detected；According to the second preset algorithm treat detection object historical position be filtered after calculate, obtain the second predicted position of object to be detected；It is corrected with reference to the continuity of the first predicted position and the second predicted position according to audio signal in time, obtain the position that object to be detected is currently located, solve due to lacking the position tracking technology to spokesman, the problem of causing position and the tracking acquisition spokesman's multimedia messages that can not show spokesman in time in net meeting system, reach the position for obtaining spokesman in time and tracking obtains spokesman's multimedia messages effect.

Description

The methods, devices and systems of audio signal, image procossing

Technical field

The present invention relates to speech recognition technology application field, in particular to a kind of audio signal, the side of image procossing Method, device and system.

Background technology

With the fast development of video communication technology, teleconference television services increasingly rise.In Remote Video Conference system During the use of system, how the sound of foundation spokesman is positioned and shown by equipment, is become now long-range The problem of to be solved in video conferencing system.

For, due to lacking the position tracking technology to spokesman, causing in correlation technique in net meeting system The problem of position of spokesman and tracking obtain spokesman's multimedia messages can not be shown in time, not yet proposed at present effective Solution.

The content of the invention

The embodiments of the invention provide a kind of audio signal, the methods, devices and systems of image procossing, at least to solve phase Due to lacking the position tracking technology to spokesman in the technology of pass, cause that hair can not be shown in time in net meeting system The problem of position of speaker and tracking obtain spokesman's multimedia messages.

According to one embodiment of present invention, there is provided a kind of method of Audio Signal Processing, including：It is default according to first Algorithm is calculated according to the audio signal that multiple Mikes gather, and obtains the first predicted position of object to be detected；According to second Preset algorithm treat detection object historical position be filtered after calculate, obtain the second predicted position of object to be detected；Knot Close the continuity of the first predicted position and the second predicted position according to audio signal in time to be corrected, it is to be detected right to obtain As the position being currently located.

Optionally, calculated, obtained to be detected according to the audio signal that multiple Mikes gather according to the first preset algorithm First predicted position of object includes：Multiple Mikes are classified, are divided into the first microphone array and the second microphone array；According to The first angle between object and the first microphone array to be detected is calculated according to the first preset algorithm, and according to the first preset algorithm Calculate the second angle between object to be detected and the second microphone array；According to default trigonometric function, pass through the first angle and the Two angles, the first predicted position of object to be detected is calculated.

Further, optionally, the between object to be detected and the first microphone array is calculated according to the first preset algorithm One angle includes：In the case where the first preset algorithm is arrival time difference algorithm TDOA, calculate each in the first microphone array Euclidean distance between the audio signal of Mike's collection；According to the Euclidean distance and the between the audio signal of each Mike collection The relation of one angle is calculated, and obtains the estimation value set of the first angle；The average of the estimation value set of the first angle is calculated, And average is defined as the first angle.

Optionally, the second angle bag between object and the second microphone array to be detected is calculated according to the first preset algorithm Include：In the case where the first preset algorithm is arrival time difference algorithm TDOA, each Mike's collection in the second microphone array is calculated Audio signal between Euclidean distance；According to the Euclidean distance and second angle between the audio signal of each Mike collection Relation is calculated, and obtains the estimation value set of the second angle；The average of the estimation value set of the second angle is calculated, and by average It is defined as the second angle.

Optionally, according to the second preset algorithm treat detection object historical position be filtered after calculate, obtain to be checked Surveying the second predicted position of object includes：Calculate the first pre- measuring angle of the first microphone array respectively by the first preset algorithm First estimation value set, and the second estimation value set of the second pre- measuring angle of the second microphone array；In the second preset algorithm In the case of for Kalman filtering algorithm, the first estimation value set and the second estimate are judged respectively by Kalman filtering algorithm Whether set meets preparatory condition；The first angle and the second angle are determined according to judged result；According to default trigonometric function, pass through First angle and the second angle are calculated, and obtain the second predicted position of object to be detected.

Further, optionally, after the position that object to be detected is currently located is obtained, method also includes：Foundation is treated The position that detection object is currently located, update Kalman filter parameter.

Further, optionally, after the position that object to be detected is currently located is obtained, method also includes：Enhancing is treated The voice output of detection object.

According to another embodiment of the invention, there is provided a kind of method of image procossing, including：By presetting Mike's battle array Row obtain the first depth value of the image capture device of the first microphone array and display device, and the second microphone array and display Second depth value of the image capture device of equipment；The first microphone array and IMAQ corresponding to the first depth value are calculated respectively The first kind angle of equipment, and calculate the second class of the second microphone array and image capture device folder corresponding to the second depth value Angle；According to the first depth value, the second depth value, first kind angle and the second class angle structure hyperspace coordinate system；Acquisition is treated The position of detection object, and determine position of the object to be detected in hyperspace coordinate system according to hyperspace coordinate system.

Optionally, the first kind of the first microphone array and image capture device corresponding to the first depth value is calculated respectively to press from both sides Angle, and the second class angle of the second microphone array and image capture device corresponding to the second depth value of calculating include：According to the One depth and the second depth and the preparatory condition of actual range, calculate first kind angle and the second class angle.

According to still another embodiment of the invention, there is provided a kind of device of Audio Signal Processing, including：First calculates mould Block, for being calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain the of object to be detected One predicted position；Second computing module, after the historical position for treating detection object according to the second preset algorithm is filtered Calculate, obtain the second predicted position of object to be detected；Correction module, for combining the first predicted position and the second predicted position It is corrected according to the continuity of audio signal in time, obtains the position that object to be detected is currently located.

According to still a further embodiment, there is provided a kind of device of image procossing, including：By presetting Mike's battle array Row obtain the first depth value of the image capture device of the first microphone array and display device, and the second microphone array and display Second depth value of the image capture device of equipment；Computing module, for calculating the first Mike corresponding to the first depth value respectively The first kind angle of array and image capture device, and calculate the second microphone array and IMAQ corresponding to the second depth value Second class angle of equipment；Coordinate space module, for according to the first depth value, the second depth value, first kind angle and second Class angle builds hyperspace coordinate system；Acquisition module, for obtaining the position of object to be detected, and according to hyperspace coordinate System determines position of the object to be detected in hyperspace coordinate system.

According to one embodiment of present invention, there is provided a kind of voice, the system of image procossing, including：Video conference is whole End, image capture device, depth image collecting device, the sound acquisition module and display device of multiple microphone arrays composition, its In, the sound acquisition module of multiple microphone array compositions, for gathering the audio signal of object to be detected；Image capture device, For gathering all video images in meeting-place；Depth image collecting device, for gathering the depth image in meeting-place, depth image For obtaining the positional information between participant and depth image collecting device；Video conference terminal, for tracking participant's Position, displaying participant speech when image and carry out minutes.

According to still another embodiment of the invention, a kind of storage medium is additionally provided.The storage medium is arranged to storage and used In the program code for performing following steps：Calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, Obtain the first predicted position of object to be detected；According to the second preset algorithm treat detection object historical position be filtered after Calculate, obtain the second predicted position of object to be detected；With reference to the first predicted position and the second predicted position according to audio signal Continuity in time is corrected, and obtains the position that object to be detected is currently located.

Alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the first pre- imputation Method is calculated according to the audio signal that multiple Mikes gather, and is obtained the first predicted position of object to be detected and is included：Will be multiple Mike is classified, and is divided into the first microphone array and the second microphone array；According to the first preset algorithm calculate object to be detected with The first angle between first microphone array, and according to the first preset algorithm calculate object to be detected and the second microphone array it Between the second angle；According to default trigonometric function, by the first angle and the second angle, the first of object to be detected is calculated Predicted position.

Further, alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the The first angle that one preset algorithm is calculated between object and the first microphone array to be detected includes：It is arrival in the first preset algorithm In the case of time difference algorithm TDOA, calculate in the first microphone array between the audio signal of each Mike collection it is European away from From；Relation according to Euclidean distance and the first angle between the audio signal of each Mike collection is calculated, and obtains first The estimation value set of angle；The average of the estimation value set of the first angle is calculated, and average is defined as the first angle.

Alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the first imputation in advance The second angle that method is calculated between object and the second microphone array to be detected includes：It is reaching time-difference in the first preset algorithm In the case of algorithm TDOA, the Euclidean distance between the audio signal of each Mike's collection in the second microphone array is calculated；Foundation The relation of Euclidean distance and the second angle between the audio signal of each Mike's collection is calculated, and obtains estimating for the second angle Evaluation set；The average of the estimation value set of the second angle is calculated, and average is defined as the second angle.

Alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the second pre- imputation Method treat detection object historical position be filtered after calculate, obtaining the second predicted position of object to be detected includes：Pass through First preset algorithm calculates the first estimation value set of the first pre- measuring angle of the first microphone array, and second Mike's battle array respectively Second estimation value set of the second pre- measuring angle of row；In the case where the second preset algorithm is Kalman filtering algorithm, pass through Kalman filtering algorithm judges whether the first estimation value set and the second estimation value set meet preparatory condition respectively；According to judgement As a result the first angle and the second angle are determined；The first angle and the second angle are calculated according to default trigonometric function, is obtained to be detected Second predicted position of object.

Further, optionally, storage medium is also configured to the program code that storage is used to perform following steps：Obtaining After the position that object to be detected is currently located, method also includes：The position being currently located according to object to be detected, update karr Graceful filter parameter.

Further, alternatively, storage medium is also configured to the program code that storage is used to perform following steps：Obtaining After the position that object to be detected is currently located, method also includes：Strengthen the voice output of object to be detected.

By the present invention, due to being calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain To the first predicted position of object to be detected；According to the second preset algorithm treat detection object historical position be filtered after count Calculate, obtain the second predicted position of object to be detected；Exist with reference to the first predicted position and the second predicted position according to audio signal Temporal continuity is corrected, and obtains the position that object to be detected is currently located.Therefore, can solve due to lacking to hair The position tracking technology of speaker, cause not showing that the position of spokesman and tracking obtain in time in net meeting system The problem of taking spokesman's multimedia messages, reaches the position for obtaining spokesman in time and tracking obtains spokesman's multimedia messages Effect.

Brief description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is the flow chart of the method for Audio Signal Processing according to embodiments of the present invention；

Fig. 2 be Audio Signal Processing according to embodiments of the present invention method in two microphone arrays closed with speaker position It is schematic diagram；

Fig. 3 be Audio Signal Processing according to embodiments of the present invention method in speaker with respect to microphone array position calculate Schematic diagram；

Fig. 4 be Audio Signal Processing according to embodiments of the present invention method in TDOA algorithm schematic diagrames；

Fig. 5 is to combine more miaow heads in the method for Audio Signal Processing according to embodiments of the present invention to position original to TDOA algorithms Reason figure；

Fig. 6 is the flow chart of the method for image procossing according to embodiments of the present invention；

Fig. 7 is the method system device layout figure of image procossing according to embodiments of the present invention；

Fig. 8 is to utilize microphone array to measure TV apart from principle in the method for image procossing according to embodiments of the present invention Figure；

Fig. 9 be image procossing according to embodiments of the present invention method according to depth information calculate depth camera depth The angle schematic diagram of axle and microphone array line；

Figure 10 is the structural representation of the device of Audio Signal Processing according to embodiments of the present invention；

Figure 11 is the structural representation of the device of image procossing according to embodiments of the present invention；

Figure 12 is audio signal according to embodiments of the present invention, the structural representation of the system of image procossing

Figure 13 is the corresponding word methods of exhibiting schematic diagram of voice interested.

Embodiment

Describe the present invention in detail below with reference to accompanying drawing and in conjunction with the embodiments.It should be noted that do not conflicting In the case of, the feature in embodiment and embodiment in the application can be mutually combined.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.

The invention relates to technical term：

TDOA：Arrival time difference algorithm, Time Difference of Arrival.

Embodiment 1

Fig. 1 is the flow chart of the method for Audio Signal Processing according to embodiments of the present invention, as shown in figure 1, the flow bag Include following steps：

Step S102, calculated, obtained to be checked according to the audio signal that multiple Mikes gather according to the first preset algorithm Survey the first predicted position of object；

Step S104, according to the second preset algorithm treat detection object historical position be filtered after calculate, treated Second predicted position of detection object；

Step S106, enter with reference to the continuity of the first predicted position and the second predicted position according to audio signal in time Row correction, obtains the position that object to be detected is currently located.

By above-mentioned steps, due to being calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, Obtain the first predicted position of object to be detected；According to the second preset algorithm treat detection object historical position be filtered after Calculate, obtain the second predicted position of object to be detected；With reference to the first predicted position and the second predicted position according to audio signal Continuity in time is corrected, and obtains the position that object to be detected is currently located.Therefore, can solve due to lacking pair The position tracking technology of spokesman, cause position and the tracking that can not show spokesman in time in net meeting system The problem of obtaining spokesman's multimedia messages, reaches the position for obtaining spokesman in time and tracking obtains spokesman's multimedia letter Cease effect.

The method for the Audio Signal Processing that the embodiment of the present application provides goes for audio source tracking location technology, wherein, Auditory localization technology has very high application prospect and use value, can such as be used for detecting the position of speaker, and automatically will Video image focuses on speaker so that hearer preferably observes spokesman, it might even be possible to discovers the trickle facial table of spokesman Feelings, so as to which be listens stronger presence, it is better understood from and experiences the content to be expressed of spokesman.

Optionally, calculated in step S102 according to the first preset algorithm according to the audio signal that multiple Mikes gather, Obtaining the first predicted position of object to be detected includes：

Step1, multiple Mikes are classified, be divided into the first microphone array and the second microphone array；

Step2, the first angle between object and the first microphone array to be detected is calculated according to the first preset algorithm, and The second angle between object and the second microphone array to be detected is calculated according to the first preset algorithm；

Step3, according to default trigonometric function, by the first angle and the second angle, it is calculated the of object to be detected One predicted position.

Further, optionally, in Step2 according to the first preset algorithm calculate object to be detected and the first microphone array it Between the first angle include：

In the case where the first preset algorithm is arrival time difference algorithm TDOA, each Mike in the first microphone array is calculated Euclidean distance between the audio signal of collection；According to the Euclidean distance between the audio signal of each Mike collection and the first folder The relation at angle is calculated, and obtains the estimation value set of the first angle；The average of the estimation value set of the first angle is calculated, and will Average is defined as the first angle.

Optionally, second between object and the second microphone array to be detected is calculated in Step2 according to the first preset algorithm Angle includes：In the case where the first preset algorithm is arrival time difference algorithm TDOA, each wheat in the second microphone array is calculated Gram collection audio signal between Euclidean distance；According to the Euclidean distance and second between the audio signal of each Mike collection The relation of angle is calculated, and obtains the estimation value set of the second angle；The average of the estimation value set of the second angle is calculated, and Average is defined as the second angle.

Optionally, in step S104 according to the second preset algorithm treat detection object historical position be filtered after count Calculate, obtaining the second predicted position of object to be detected includes：

Calculate the first estimation value set of the first pre- measuring angle of the first microphone array respectively by the first preset algorithm, And second microphone array the second pre- measuring angle second estimation value set；It is Kalman filtering algorithm in the second preset algorithm In the case of, judge whether the first estimation value set and the second estimation value set meet to preset respectively by Kalman filtering algorithm Condition；The first angle and the second angle are determined according to judged result；According to default trigonometric function, pressed from both sides by the first angle and second Angle is calculated, and obtains the second predicted position of object to be detected.

Further, optionally, after obtaining the position that object to be detected is currently located in step s 106, the present invention is real Applying the method for the Audio Signal Processing of example offer also includes：The position being currently located according to object to be detected, renewal Kalman's filter Ripple device parameter.

Further, optionally, after obtaining the position that object to be detected is currently located in step s 106, the present invention is real Applying the method for the Audio Signal Processing of example offer also includes：Strengthen the voice output of object to be detected.

To sum up, the method for Audio Signal Processing provided in an embodiment of the present invention is specific as follows：

Fig. 2 be Audio Signal Processing according to embodiments of the present invention method in two microphone arrays closed with speaker position It is schematic diagram, as illustrated in fig. 2, it is assumed that some meeting-place shares two microphone arrays MicA and MicB, each microphone array respectively has four Individual Mike, microphone array MicA/MicB is regarded as the set of Mike, i.e. MicA={ MicA0, MicA1, MicA2, MicA3 }, And MicB={ MicB0, MicB1, MicB2, MicB3 }.During general video conference in some period of some meeting-place only One people's speech, therefore we first assume that some meeting-place is spoken in t, one-man, speaker is with respect to MicA and MicB Position relationship it is as shown in Figure 2.

Now the angle of speaker and MicA and MicB are not zero, it is assumed that the angle between speaker and MicA is θ₀, say The angle talked about between people and MicB is θ₁, due to the distance between MicA and MicB, it is known that according to triangle theorem, it is easy to predict Obtain the position of speaker, as shown in figure 3, Fig. 3 be Audio Signal Processing according to embodiments of the present invention method in speaker Schematic diagram is calculated with respect to microphone array position.

Speaker and MicA/MicB angle theta₀And θ₁It can be drawn according to Time Delay Estimation Algorithms such as TDOA, as shown in figure 4, Fig. 4 be Audio Signal Processing according to embodiments of the present invention method in TDOA algorithm schematic diagrames.

Assuming that voice spread speed is fixed as γ, sound source and MicA0/MicA1 angle theta₀(between MicA0 and MicA1 Line is parallel with MicB line with MicA), MicA0 and MicA1 spacing are l₀, due to sound source and MicA0 and MicA1 distance Difference, variant from sound source arrival MicA1 and MicA0 time, the time difference is Δ t：

Δ t=l₀cosθ₀/γ

Above-mentioned difference is embodied on miaow head MicA0 and MicA1, is exactly that MicA0 exists compared to the voice sequence of MicA1 samplings Time delay, it is assumed that MicA0 and MicA1 sample rate is that S, in addition MicA0 and MicA1 maximum delay are no more than l₀/γ.Herein Voice sequence X={ the x that MicA0 is sampled under constraints₀,x₁,x₂,…,x_nWith MicA1 sampling voice sequence Y={ y₀, y₁, y₂,…,y_n, X is in μ ∈ |-S*l₀/γ,S*l₀/ γ | between skew obtain X '={ x_0+μ,x_1+μ,x_2+μ,…,x_n+μ, X ' and Y it Between Euclidean distance be：

Wherein δ | μ ∈ [- S*l₀/γ,S*l₀/ γ] there is minimum value δ_min, δ_minCorresponding skew μ | δ_min, according to μ | δ_minCan Speaker and MicA0 and] angle theta between MicA1 between line₀：

MicA has four miaow heads { MicA0, MicA1, MicA2, MicA3 }, shares 6 miaow heads to { MicA0, MicA1 }, { MicA1, MicA2 }, { MicA2, MicA3 }, { MicA0, MicA2 }, { MicA1, MicA3 }, as shown in Figure 4.6 miaow heads pair can To obtain one group of estimate { θ to speaker direction_0,0, θ_0,1, θ_0,2, θ_0,3, θ_0,4, θ_0,5, by their averagePrediction result as speaker directionAllow the deviation for there are 5 ° by experimental verification.Fig. 5 is according to this hair Combine more miaow heads in the method for the Audio Signal Processing of bright embodiment to TDOA algorithm positioning schematics.

θ is obtained using same algorithm to MicB four miaow heads { MicB0, MicB1, MicB2, MicB3 }₁Prediction knot FruitThe prediction result of speaker position is obtained by simple trigonometric function operation by Fig. 3

By above-mentioned algorithm, within a period of time, a series of prediction results of speaker position can be obtainedBut because noise etc. disturbs, the prediction result that above-mentioned algorithm obtains is accurate not enough Really, therefore we are devised based on Kalman prediction tracking speaker position, the constraint as the prediction of speaker's deflection Condition, improve and combine the accuracy that more miaow heads position to TDOA algorithms.

Step 1：Pass through the position of Kalman prediction current time speakerAnd be converted into relative MicA and The prediction of the angle of MicB linesWith

Step 2：To each Mike, calculate the time delay of each of which miaow head pair using TDOA algorithms and be converted into relative MicA With the angle of MicB lines, the estimate in one group of speaker direction is obtained：{θ_i,0, θ_i,1, θ_i,2, θ_i,3, θ_i,4, θ_i,5}；

Step 3：IfThink that the prediction result deviation of Kalman filtering is too big, it is necessary to give up Abandon, directly withPrediction result as current time speaker directionOtherwise it is assumed that Kalman filtering is pre- Surveying result can receive, willEstimate exclude, i.e. U θ={ θ '_i,0,θ‘_i,1,…,θ‘_i,n-1,1<=n<=6,1<=j<=n, then willPrediction as current time speaker direction As a result

Step 4：Step 2 and step 3 are carried out to two microphone arrays, obtain the pre- of current time speaker direction Survey resultWithAnd speaker position is obtained according to simple trigonometric function operationAnd Kalman filter parameter is entered Row renewal.

Embodiment 2

Fig. 6 is the flow chart of the method for image procossing according to embodiments of the present invention, as shown in fig. 6, the flow is included such as Lower step：

Step S602, obtain the image capture device of the first microphone array and display device by presetting microphone array the One depth value, and the second depth value of the image capture device of the second microphone array and display device；

Step S604, the first kind of the first microphone array and image capture device corresponding to the first depth value is calculated respectively and is pressed from both sides Angle, and calculate the second class angle of the second microphone array and image capture device corresponding to the second depth value；

Step S606, according to the first depth value, the second depth value, first kind angle and the second class angle structure hyperspace Coordinate system；

Step S608, the position of object to be detected is obtained, and determine object to be detected more according to hyperspace coordinate system Position in dimension space coordinate system.

By above-mentioned steps, due to obtaining the IMAQ of the first microphone array and display device by presetting microphone array First depth value of equipment, and the second depth value of the image capture device of the second microphone array and display device；Count respectively The first kind angle of the first microphone array corresponding to the first depth value and image capture device is calculated, and calculates the second depth value pair The second microphone array and the second class angle of image capture device answered；According to the first depth value, the second depth value, first kind folder Angle and the second class angle structure hyperspace coordinate system；The position of object to be detected is obtained, and it is true according to hyperspace coordinate system Fixed position of the object to be detected in hyperspace coordinate system.Therefore, can solve due to lacking the position tracking to spokesman Technology, cause position and the tracking acquisition spokesman's multimedia that can not show spokesman in time in net meeting system The problem of information, reach the position for obtaining spokesman in time and tracking obtains spokesman's multimedia messages effect.

Optionally, the first microphone array corresponding to the first depth value and image capture device are calculated in step S604 respectively First kind angle, and calculate the second class angle bag of the second microphone array and image capture device corresponding to the second depth value Include：According to the first depth and the second depth and the preparatory condition of actual range, first kind angle and the second class angle are calculated.

To sum up, the method for the image procossing that the embodiment of the present application provides is specific as follows：

System requirements microphone array, depth camera, image pickup head, the relative position of TV are fixed, the institute of below figure 7 Show, Fig. 7 is the method system device layout figure of image procossing according to embodiments of the present invention.

MicA and MicB spacing in system, it is known that generally 2~3 meters of spacing, television set width can be surveyed, and television set with The keep level of line between MicA and MicB.The distance between line is unknown between TV and MicA, MicB, according to meeting room area Place.When system is installed for the first time, video conference device controls one section of voice prerecorded of televising, and passes through above-mentioned joint More miaow heads estimate TDOA algorithms position (including direction and distance) of the TV with respect to MicA and MicB, as shown in figure 8, Fig. 8 is TV is measured apart from schematic diagram using microphone array in the method for image procossing according to embodiments of the present invention.

Because microphone array has special shape and color, microphone array can be identified in image pickup head, entered And corresponding depth information being drawn in depth camera, it is assumed that MicA depth is Depth0, and MicB depth is Depth1, Angle of the camera with respect to MicA and MicB can be calculated using trigonometric function.

Fig. 9 be image procossing according to embodiments of the present invention method according to depth information calculate depth camera depth The angle schematic diagram of axle and microphone array line, as shown in figure 9, due to microphone array positioning be relative microphone array direction and Position, and direction and position that depth camera positioning is then relative depth camera are utilized, two kinds of letters are utilized in system Breath realizes accurate positioning speaker, it is also necessary to changes coordinate system.The microphone array used in the system can only navigate to two Dimension space position, corresponding is two reference axis in left and right and depth in depth camera, i.e. x-axis and z-axis.Assuming that in Mike In array two-dimensional space, MicA coordinate is (0,0), and MicB coordinate is (length, 0), and wherein length is microphone array Spacing.After microphone array positions, the coordinate of depth camera (with TV in same position) is (x, y), with respect to MicA's Direction is θ₀, the direction with respect to MicB is θ₁.And microphone array MicA depth is depth0 in depth camera, MicB depth Spend for depth1.According to above- mentioned information, although depth information and being not equal to actual range, meet with actual range：

Y=f (depth)

Wherein y₀And y₁It is actual range of the MicA and MicB relative depths camera in depth direction.According to trigonometric function It can obtain：

I.e.：

From triangle geometric knowledge, θ 2 and θ 3 meet again：

θ₂+θ₃=θ₀+θ₁

Finally it can be calculated：

θ₂=(θ₀+θ₁)-θ₃

Pay attention to only analyzing θ 2 and θ 3 here and θ 0, θ 1 be acute angle situation, other situations are similar.According to above-mentioned side The two-dimensional spatial location for the relative microphone array that method can navigate to microphone array is transformed into the three dimensions of depth camera In left and right and depth shaft position.

During user's use, camera angle (microphone array position is fixed) can be changed, once camera angle Change, system, which must possess, automatically updates parameter, re-starts above-mentioned computing, and the two-dimensional spatial location of microphone array is turned The left and right changed in the three dimensions of depth camera and depth shaft position.And taken the photograph because user can change in conference process As brilliance degree, therefore default recording can not be played, system can utilize single distal end in echo cancellation algorithm to adjudicate, when determining some Between only television set playback in section, talked without local speaker, so that it is guaranteed that above-mentioned computing will not be interfered, result of calculation foot It is enough accurate.The speaker position that microphone array is estimated can be transformed into using the above method in depth/image pickup head Position, recycle Face Detection/Face datection scheduling algorithm to obtain the position of speaker in depth/image pickup head.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but a lot In the case of the former be more preferably embodiment.Based on such understanding, technical scheme is substantially in other words to existing The part that technology contributes can be embodied in the form of software product, and the computer software product is stored in a storage In medium (such as ROM/RAM, magnetic disc, CD), including some instructions to cause a station terminal equipment (can be mobile phone, calculate Machine, server, or network equipment etc.) perform method described in each embodiment of the present invention.

Embodiment 3

A kind of device of Audio Signal Processing is additionally provided in the present embodiment, and the device is used to realize above-described embodiment And preferred embodiment, carried out repeating no more for explanation.As used below, term " module " can be realized predetermined The combination of the software and/or hardware of function.Although device described by following examples is preferably realized with software, firmly Part, or the realization of the combination of software and hardware is also what may and be contemplated.

Figure 10 is the structural representation of the device of Audio Signal Processing according to embodiments of the present invention, as shown in Figure 10, should Device includes：

First computing module 1002, carry out by the audio signal gathered according to the first preset algorithm according to multiple Mikes based on Calculate, obtain the first predicted position of object to be detected；Second computing module 1004, for according to the second preset algorithm to be detected The historical position of object calculates after being filtered, and obtains the second predicted position of object to be detected；Correction module 1006, for tying Close the continuity of the first predicted position and the second predicted position according to audio signal in time to be corrected, it is to be detected right to obtain As the position being currently located.

In the device for the Audio Signal Processing that the embodiment of the present invention passes through, due to according to the first preset algorithm according to multiple wheats Gram collection audio signal calculated, obtain the first predicted position of object to be detected；According to the second preset algorithm to be checked The historical position of survey object calculates after being filtered, and obtains the second predicted position of object to be detected；With reference to the first predicted position It is corrected with continuity of second predicted position according to audio signal in time, obtains the position that object to be detected is currently located Put.Therefore, can solve due to lacking the position tracking technology to spokesman, cause in net meeting system can not and When show spokesman position and tracking obtain spokesman's multimedia messages the problem of, reach in time obtain spokesman position And tracking obtains spokesman's multimedia messages effect.

Embodiment 4

Figure 11 is the structural representation of the device of image procossing according to embodiments of the present invention, as shown in figure 11, the device Including：

Acquisition module 1102, for obtaining the IMAQ of the first microphone array and display device by presetting microphone array First depth value of equipment, and the second depth value of the image capture device of the second microphone array and display device；Calculate mould Block 1104, for calculating the first kind angle of the first microphone array corresponding to the first depth value and image capture device respectively, with And calculate the second class angle of the second microphone array and image capture device corresponding to the second depth value；Coordinate space module 1106, for according to the first depth value, the second depth value, first kind angle and the second class angle structure hyperspace coordinate system； Acquisition module 1108, determine object to be detected more for obtaining the position of object to be detected, and according to hyperspace coordinate system Position in dimension space coordinate system.

In the device for the image procossing that the embodiment of the present invention passes through, due to according to the first depth value, the second depth value, first Class angle and the second class angle structure hyperspace coordinate system.Therefore, can solve due to lacking the position tracking to spokesman Technology, cause position and the tracking acquisition spokesman's multimedia that can not show spokesman in time in net meeting system The problem of information, reach the position for obtaining spokesman in time and tracking obtains spokesman's multimedia messages effect.

Embodiment 5

According to one embodiment of present invention, there is provided a kind of audio signal, the system of image procossing, including：Video council View terminal, image capture device, depth image collecting device, the sound acquisition module of multiple microphone arrays composition and display are set It is standby, wherein, the sound acquisition module of multiple microphone array compositions, for gathering the audio signal of object to be detected；IMAQ Equipment, for gathering all video images in meeting-place；Depth image collecting device, it is deep for gathering the depth image in meeting-place Degree image is used to obtain the positional information between participant and depth image collecting device；Video conference terminal, for track with The position of meeting person, displaying participant speech when image and carry out minutes.

To sum up, in conjunction with the embodiments 1 to embodiment 5, audio signal that the embodiment of the present application provides, the method for image procossing, Device and system are specific as follows：

First, the system tracks speaker position to TDOA algorithms real-time estimate according to more miaow heads are combined, while utilizes card Kalman Filtering predicting tracing speaker position, and self-correcting is carried out according to the continuity of voice signal in time, obtain Accurate speaker's location estimation.

In addition, fixed placement depth camera in system, indoor each participant's depth information is obtained by depth camera, Estimated result of the microphone array to speaker position is adjusted as constraints.

Next, speaker's positional information of acquisition is fed back to system diagram as camera by system, speaker's image is captured.

Finally, speaker's voice is identified according to above- mentioned information, or carries out speech enhan-cement, most result is presented at last User, can be the form of dynamic title or the minutes with speaker's image.

The hardware includes：Video conference terminal, image pickup head, depth camera, two microphone array A and B, TV.

This method and system realize can realize specifically automatically during video conference according to the selection of user Speaker's locating and tracking interested, strengthens special sound, Audio Signal Processing is thought so as to further realize Dynamic title or minutes are presented in user.This programme has real-time simple, fast advantage, and locating and tracking is more accurate The characteristics of real-time.

Wherein, Audio Signal Processing interested, enhancing and displaying are specific as follows：

The above method has been able to the speaker position being calculated by microphone array, and combines image and depth camera Head obtains the relative position information of microphone array, and most speaker associates with depth/image pickup head and determines position at last Relation.User is arranged to voice interested when can be talked by some speaker in systems, to extract the language of the speaker Sound；Its voice can also be arranged to voice interested afterwards by selecting some participant in the video image of system, with Just the voice of the speaker is extracted.Beamforming algorithm can be additionally utilized, the voice in direction where voice interested is increased By force, by the voice suppression in direction where non-voice interested.Face datection algorithm can also be utilized, obtains the head portrait of speaker, With reference to Audio Signal Processing algorithm, spoken during showing meeting to user human head picture and content information.

Figure 12 is audio signal according to embodiments of the present invention, the structural representation of the system of image procossing, such as Figure 12 institutes Show, acoustic signal processing method interested：User selects some participant in the video image of system, and the participant is made It is as follows for voice speaker interested, step：

Step 1：Locally whether someone speaks for detection in real time in system operation, if someone speaks, utilizes microphone array Estimate speaker position, and be transformed into left and right and the depth shaft position of the three dimensions of depth camera；

Step 2：Locally or remotely participant, in video image where mouse or touch-control selection speaker interested Region, the people in the region is as speaker interested；

Step 3：System determines the face characteristic of speaker interested, is spoken using Face tracking algorithm tracking is interested People, and real-time update voice speaker position interested, and be converted to the speaker position of microphone array estimation；

Step 4：Using beamforming algorithm, by the speech enhan-cement in direction where voice interested, by non-voice interested The voice suppression in place direction.

Wherein, voice methods of exhibiting interested is specific as follows：

Voice interested can obtain the content of speaker's speech after Audio Signal Processing.If user needs to make With Audio Signal Processing interested and Enhancement Method, speaker interested is identified, is directly examined in selected region by face Survey and track algorithm, obtain human face region image, pass through Audio Signal Processing and Enhancement Method can pair interested above Operation is identified in the voice of speaker interested, and such system can obtain some speaker interested of some period and speak Content (text mode), and the face-image of the speaker interested using these information, can finally be presented to user One static user that is easy to watches and recalled the minutes or real-time captions that both pictures and texts are excellent.

Certainly above is saying that the voice content substituted records to specific, if to the proprietary language of whole conference process Sound content keeps a record, and record flow is different.First：In meeting, system can carry out real to the image that image pickup head gathers When face recognition to determine the facial characteristics of participant all in field of view, carry out detection in real time here to tackle in meeting During participant dynamic temporarily away from or increase.Next：When participant makes a speech, all participants are determined by the above method The relative microphone array position of person, and then (spokesman can be multiple) is strengthened to the voice of spokesman, and to its voice Be identified, stored with text mode, with reference to the speech Human Head Region Image Segment that is extracted from image pickup head generate real-time captions or It is complete minutes, minutes are preserved on the basis of the time, also support the edit operations such as corresponding filtering screening certainly. As shown in figure 13, Figure 13 is the corresponding word methods of exhibiting schematic diagram of voice interested.

Embodiment 6

Embodiments of the invention additionally provide a kind of storage medium.Alternatively, in the present embodiment, above-mentioned storage medium can The program code for performing following steps to be arranged to storage to be used for：

S1, calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain object to be detected The first predicted position；

S2, according to the second preset algorithm treat detection object historical position be filtered after calculate, it is to be detected right to obtain The second predicted position of elephant；

S3, school is carried out with reference to the continuity of the first predicted position and the second predicted position according to audio signal in time Just, the position that object to be detected is currently located is obtained.

Alternatively, storage medium is also configured to the program code that storage is used to perform following steps：

S1, calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain object to be detected The first predicted position include：Multiple Mikes are classified, are divided into the first microphone array and the second microphone array；According to first Preset algorithm calculates the first angle between object and the first microphone array to be detected, and calculates and treat according to the first preset algorithm The second angle between detection object and the second microphone array；According to default trigonometric function, by the first angle and the second angle, The first predicted position of object to be detected is calculated.

Alternatively, in the present embodiment, above-mentioned storage medium can include but is not limited to：USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. is various can be with the medium of store program codes.

Further, alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the The first angle that one preset algorithm is calculated between object and the first microphone array to be detected includes：It is arrival in the first preset algorithm In the case of time difference algorithm TDOA, the Euclidean distance in the first microphone array between the audio signal of each Mike's collection；According to Calculated according to the relation of Euclidean distance and the first angle between the audio signal of each Mike collection, obtain the first angle Estimate value set；The average of the estimation value set of the first angle is calculated, and average is defined as the first angle.

Alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the first imputation in advance The second angle that method is calculated between object and the second microphone array to be detected includes：Calculated in the first preset algorithm for reaching time-difference In the case of method TDOA, the Euclidean distance between the audio signal of each Mike's collection in the second microphone array is calculated；According to every The relation of Euclidean distance and the second angle between the audio signal of individual Mike's collection is calculated, and obtains the estimation of the second angle Value set；The average of the estimation value set of the second angle is calculated, and average is defined as the second angle.

Alternatively, storage medium is also configured to the program code that storage is used to perform following steps：According to the second pre- imputation Method treat detection object historical position be filtered after calculate, obtaining the second predicted position of object to be detected includes：Pass through First preset algorithm calculates the first estimation value set of the first pre- measuring angle of the first microphone array, and second Mike's battle array respectively Second estimation value set of the second pre- measuring angle of row；In the case where the second preset algorithm is Kalman filtering algorithm, pass through Kalman filtering algorithm judges whether the first estimation value set and the second estimation value set meet preparatory condition respectively；According to judgement As a result the first angle and the second angle are determined；According to default trigonometric function, calculated, obtained by the first angle and the second angle To the second predicted position of object to be detected.

Further, optionally, storage medium is also configured to the program code that storage is used to perform following steps：Foundation is treated The position that detection object is currently located, update Kalman filter parameter.

Further, alternatively, storage medium is also configured to the program code that storage is used to perform following steps： After the position being currently located to object to be detected, method also includes：Strengthen the voice output of object to be detected.

Alternatively, the specific example in the present embodiment may be referred to described in above-described embodiment and optional embodiment Example, the present embodiment will not be repeated here.

Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored Performed in the storage device by computing device, and in some cases, can be with different from shown in order execution herein The step of going out or describing, they are either fabricated to each integrated circuit modules respectively or by multiple modules in them or Step is fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific hardware and software combination.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

A kind of 1. method of Audio Signal Processing, it is characterised in that including：

Calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain object to be detected first is pre- Location is put；

Calculated after being filtered according to the second preset algorithm to the historical position of the object to be detected, it is described to be detected right to obtain The second predicted position of elephant；

Enter with reference to the continuity of first predicted position and second predicted position according to the audio signal in time Row correction, obtains the position that the object to be detected is currently located.
2. according to the method for claim 1, it is characterised in that described to be gathered according to the first preset algorithm according to multiple Mikes Audio signal calculated, obtaining the first predicted position of object to be detected includes：

The multiple Mike is classified, is divided into the first microphone array and the second microphone array；

The first angle between the object to be detected and first microphone array is calculated according to first preset algorithm, with And calculate the second angle between the object to be detected and second microphone array according to first preset algorithm；

According to default trigonometric function, by first angle and second angle, the object to be detected is calculated First predicted position.
3. according to the method for claim 2, it is characterised in that described described to be checked according to first preset algorithm calculating The first angle surveyed between object and first microphone array includes：

In the case where first preset algorithm is arrival time difference algorithm TDOA, calculate each in first microphone array Euclidean distance between the audio signal of Mike's collection；

Relation according to the Euclidean distance and first angle between the audio signal of each Mike collection is calculated, Obtain the estimation value set of first angle；

The average of the estimation value set of first angle is calculated, and the average is defined as first angle.
4. according to the method for claim 2, it is characterised in that described described to be checked according to first preset algorithm calculating The second angle surveyed between object and second microphone array includes：

In the case where first preset algorithm is arrival time difference algorithm TDOA, calculate each in second microphone array Euclidean distance between the audio signal of Mike's collection；

Relation according to the Euclidean distance and second angle between the audio signal of each Mike collection is calculated, Obtain the estimation value set of second angle；

The average of the estimation value set of second angle is calculated, and the average is defined as second angle.
5. according to the method for claim 2, it is characterised in that it is described according to the second preset algorithm to the object to be detected Historical position be filtered after calculate, obtaining the second predicted position of the object to be detected includes：

Calculate the first estimate collection of the first pre- measuring angle of first microphone array respectively by first preset algorithm Close, and the second estimation value set of the second pre- measuring angle of second microphone array；

In the case where second preset algorithm is Kalman filtering algorithm, judged respectively by the Kalman filtering algorithm Whether the first estimation value set and the second estimation value set meet preparatory condition；

First angle and second angle are determined according to judged result；

According to default trigonometric function, calculated by first angle and second angle, it is above-mentioned to be detected right to obtain The second predicted position of elephant.
6. according to the method for claim 5, it is characterised in that obtain position that the object to be detected is currently located it Afterwards, methods described also includes：

The position being currently located according to the object to be detected, update Kalman filter parameter.
7. method according to any one of claim 1 to 6, it is characterised in that the object to be detected is current obtaining After the position at place, methods described also includes：

Strengthen the voice output of the object to be detected.
A kind of 8. method of image procossing, it is characterised in that including：

The first depth value for obtaining the image capture device of the first microphone array and display device by presetting microphone array, and Second depth value of the second microphone array and the image capture device of the display device；

The first kind for calculating first microphone array corresponding to first depth value and described image collecting device respectively is pressed from both sides Angle, and calculate the second class folder of second microphone array corresponding to second depth value and described image collecting device Angle；

Multidimensional is built according to first depth value, second depth value, the first kind angle and the second class angle Space coordinates；

The position of object to be detected is obtained, and determines the object to be detected in the multidimensional according to the hyperspace coordinate system Position in space coordinates.
9. according to the method for claim 8, it is characterised in that calculate respectively described first corresponding to first depth value The first kind angle of microphone array and described image collecting device, and calculate second wheat corresponding to second depth value Second class angle of gram array and described image collecting device includes：

According to first depth and second depth and the preparatory condition of actual range, the first kind angle and institute are calculated State the second class angle.
A kind of 10. device of Audio Signal Processing, it is characterised in that including：

First computing module, for being calculated according to the first preset algorithm according to the audio signal that multiple Mikes gather, obtain First predicted position of object to be detected；

Second computing module, after being filtered according to the second preset algorithm to the historical position of the object to be detected based on Calculate, obtain the second predicted position of the object to be detected；

Correction module, for reference to first predicted position and second predicted position according to the audio signal in the time On continuity be corrected, obtain the position that the object to be detected is currently located.
A kind of 11. device of image procossing, it is characterised in that including：

Acquisition module, for obtain the image capture device of the first microphone array and display device by presetting microphone array the One depth value, and the second depth value of the second microphone array and the image capture device of the display device；

Computing module, set for calculating first microphone array corresponding to first depth value respectively with described image collection Standby first kind angle, and calculate second microphone array corresponding to second depth value and described image collecting device The second class angle；

Coordinate space module, for according to first depth value, second depth value, the first kind angle and described Two class angles build hyperspace coordinate system；

Acquisition module, for obtaining the position of object to be detected, and determine according to the hyperspace coordinate system described to be detected Position of the object in the hyperspace coordinate system.
12. a kind of audio signal, the system of image procossing, it is characterised in that including：Video conference terminal, image capture device, The sound acquisition module and display device of depth image collecting device, multiple microphone arrays composition, wherein,

The sound acquisition module of the multiple microphone array composition, for gathering the audio signal of object to be detected；

Described image collecting device, for gathering all video images in meeting-place；

The depth image collecting device, for gathering the depth image in the meeting-place, the depth image be used to obtaining with Positional information between meeting person and the depth image collecting device；

The video conference terminal, for tracking the position of the participant, show image of the participant in speech simultaneously Carry out minutes.