CN113539288A

CN113539288A - Voice signal denoising method and device

Info

Publication number: CN113539288A
Application number: CN202110829968.5A
Authority: CN
Inventors: 郝昊; 李骊
Original assignee: Nanjing Huajie Imi Technology Co ltd; Beijing HJIMI Technology Co Ltd
Current assignee: Nanjing Huajie Imi Technology Co ltd; Beijing HJIMI Technology Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-10-22

Abstract

The invention discloses a voice signal denoising method and a device, which can obtain a plurality of paths of voice signals collected by a microphone array, wherein the plurality of paths of voice signals comprise noise, carry out sound source localization on the plurality of paths of voice signals, determine a sound source azimuth angle, determine the voice signal positioned on a target azimuth angle as the noise, the angle deviation value of the target azimuth angle and the sound source azimuth angle is not less than a preset angle deviation threshold value, determine a corresponding target sub-band filter coefficient based on the angle deviation value of the target azimuth angle and the sound source azimuth angle, and carry out denoising processing on the voice signal positioned on the target azimuth angle based on the target sub-band filter coefficient. The invention can avoid influencing the voice signal output by the target sound source, realize effective denoising processing on the collected multi-path voice signal and effectively improve the audio denoising capability.

Description

Voice signal denoising method and device

Technical Field

The present invention relates to the field of signal processing technologies, and in particular, to a method and an apparatus for denoising a speech signal.

Background

With the improvement of signal processing technology, the voice signal denoising technology is continuously improved.

Currently, in the prior art, a voice signal output by a target sound source can be collected, denoised, recorded and the like in a specific place by using a voice collecting device.

However, the prior art cannot effectively denoise the collected voice signal.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for denoising a speech signal, which overcome the above problems or at least partially solve the above problems, and the technical solution is as follows:

a method for denoising a speech signal, comprising:

obtaining a plurality of paths of voice signals collected by a microphone array, wherein the plurality of paths of voice signals comprise noise;

carrying out sound source positioning on the multi-channel voice signals, and determining a sound source azimuth angle;

determining the voice signal positioned on a target azimuth as noise, wherein the angle deviation value of the target azimuth and the sound source azimuth is not less than a preset angle deviation threshold;

determining a corresponding target sub-band filter coefficient based on the angle deviation value of the target azimuth and the sound source azimuth;

and denoising the voice signal positioned on the target azimuth angle based on the target subband filter coefficient.

Optionally, the performing sound source localization on the multiple paths of voice signals and determining a sound source azimuth angle includes:

and carrying out sound source positioning on the multi-path voice signals by combining a microphone array sound source positioning technology and an image recognition technology, and determining the azimuth angle of the sound source.

Optionally, the combining with the microphone array sound source localization technology and the image recognition technology, performing sound source localization on the multiple paths of voice signals, and determining the sound source azimuth angle includes:

acquiring a target image which is shot by a camera and contains a speaker, identifying key points of the speaker in the target image by utilizing an image identification technology, determining the coordinates of the key points of the speaker, and determining a first azimuth angle of the speaker based on the coordinates of the key points of the speaker;

carrying out sound source localization on the multi-path voice signals by using a microphone array sound source localization technology, and determining a second azimuth angle of a target sound source;

and judging whether the first azimuth angle is matched with the second azimuth angle, if so, determining that the speaker is the target sound source, and determining the sound source azimuth angle based on the first azimuth angle and the second azimuth angle.

Optionally, the obtaining a target image shot by a camera and containing a speaker, performing human key point recognition on the speaker in the target image by using an image recognition technology, determining a human key point coordinate of the speaker, and determining a first azimuth of the speaker based on the human key point coordinate includes:

obtaining the target image which is shot by a depth camera and contains the speaker, and determining the depth distance from the speaker to the depth camera in the target image;

recognizing key points of the human body of the speaker in the target image by utilizing a human body posture estimation technology, and determining the head coordinate of the speaker;

determining the first azimuth based on the depth distance and the head coordinates.

Optionally, the determining whether the first azimuth is matched with the second azimuth includes:

performing coordinate system transformation on the first azimuth to obtain a third azimuth corresponding to the first azimuth in the microphone array coordinate system;

comparing whether the difference value of the third azimuth angle and the second azimuth angle is not larger than a preset azimuth angle deviation threshold value, and if so, determining that the first azimuth angle is matched with the second azimuth angle;

said determining said sound source azimuth based on said first azimuth and said second azimuth comprises:

determining the sound source azimuth based on the third azimuth and the second azimuth.

Optionally, the determining a corresponding target subband filter coefficient based on the angle deviation value between the target azimuth and the sound source azimuth includes:

inputting the target azimuth angle and the sound source azimuth angle into a sub-band filter coefficient calculation model, and determining a sub-band filter coefficient output by the sub-band filter coefficient calculation model as the target sub-band filter coefficient; wherein:

the subband filter coefficient calculation model is as follows:

h_voice(f_i)＝(0.5+0.5cos(θ(f_i)-θ_voice))¹⁰；

wherein i is the sequence number of the frequency spectral line in the frequency domain of the multi-path voice signal, f_iIs the frequency, h, corresponding to the frequency spectrum line with the sequence number i_voice(fi) is the frequency f_iCorresponding sub-band filter coefficients, θ (fi) being the frequency f_iCorresponding to said target azimuth angle θ_voiceIs the sound source azimuth.

Optionally, the denoising processing, performed on the speech signal located at the target azimuth angle based on the target subband filter coefficient, includes:

inputting the target sub-band filter coefficient and the voice signal positioned on the target azimuth angle into a denoising calculation model to obtain a denoised voice signal output by the denoising calculation model; wherein, the denoising calculation model is as follows:

y(f_i)＝x(f_i)·h_voice(f_i)；

wherein y (fi) is a denoised speech signal and x (fi) is the speech signal located at the target azimuth.

Optionally, the head coordinates include a head abscissa, and the determining the first azimuth angle based on the depth distance and the head coordinates includes:

inputting the depth distance and the head abscissa into an azimuth calculation model, and determining an azimuth output by the azimuth calculation model as the first azimuth; wherein the azimuth calculation model is:

wherein, theta₁Is the first azimuth angle, x₁Is the head abscissa and d is the depth distance.

A speech signal denoising apparatus, comprising: the device comprises a first obtaining unit, a first positioning unit, a first determining unit, a second determining unit, a third determining unit and a denoising processing unit, wherein:

the first obtaining unit is configured to perform: obtaining a plurality of paths of voice signals collected by a microphone array, wherein the plurality of paths of voice signals comprise noise;

the first positioning unit configured to perform: carrying out sound source positioning on the multi-channel voice signals;

the first determination unit is configured to perform: determining a sound source azimuth angle;

the second determination unit configured to perform: determining the voice signal positioned on a target azimuth as noise, wherein the angle deviation value of the target azimuth and the sound source azimuth is not less than a preset angle deviation threshold;

the third determination unit is configured to perform: determining a corresponding target sub-band filter coefficient based on the angle deviation value of the target azimuth and the sound source azimuth;

the denoising processing unit is configured to perform: and denoising the voice signal positioned on the target azimuth angle based on the target subband filter coefficient.

Optionally, the first positioning unit is configured to perform:

and carrying out sound source positioning on the multi-path voice signals by combining a microphone array sound source positioning technology and an image recognition technology.

Optionally, the first positioning unit includes: a second obtaining unit, a first identifying unit, a fourth determining unit, a fifth determining unit, a second positioning unit, a sixth determining unit, a judging unit and a seventh determining unit, wherein:

the second obtaining unit is configured to perform: obtaining a target image containing a speaker shot by a camera;

the first identification unit is configured to perform: carrying out human body key point identification on the speaker in the target image by utilizing an image identification technology;

the fourth determination unit configured to perform: determining the human body key point coordinates of the speaker;

the fifth determination unit configured to perform: determining a first azimuth angle of the speaker based on the human body key point coordinates;

the second positioning unit is configured to perform: carrying out sound source positioning on the multi-path voice signals by using a microphone array sound source positioning technology;

the sixth determining unit configured to perform: determining a second azimuth angle of the target sound source;

the judging unit is configured to execute: judging whether the first azimuth angle is matched with the second azimuth angle, and if so, triggering the seventh determining unit;

the seventh determining unit configured to perform: determining that the speaker is the target sound source;

the first determination unit is configured to perform: determining the sound source azimuth based on the first azimuth and the second azimuth.

Optionally, the second obtaining unit is configured to perform: obtaining the target image which is shot by a depth camera and contains the speaker;

the first recognition unit includes: an eighth determining unit and a second identifying unit;

the eighth determining unit configured to perform: determining a depth distance from the speaker to the depth camera in the target image;

the second identification unit is configured to perform: identifying human key points of the human body region of the speaker in the target image by utilizing a human body posture estimation technology;

the fourth determination unit configured to perform: determining head coordinates of the speaker;

the fifth determination unit configured to perform: determining the first azimuth based on the depth distance and the head coordinates.

Optionally, the first azimuth angle is obtained in a camera coordinate system, and the second azimuth angle is obtained in a microphone array coordinate system;

the judging unit includes: the device comprises a transformation unit, a third obtaining unit, a comparison unit and a ninth determining unit;

the transformation unit configured to perform: transforming the coordinate system of the first azimuth;

the third obtaining unit is configured to perform: obtaining a third azimuth corresponding to the first azimuth in the microphone array coordinate system;

the comparison unit configured to perform: comparing whether the difference value of the third azimuth angle and the second azimuth angle is not larger than a preset azimuth angle deviation threshold value, and if so, triggering the ninth determining unit;

the ninth determining unit configured to perform: determining that the first azimuth matches the second azimuth;

the first determination unit is configured to perform: determining the sound source azimuth based on the third azimuth and the second azimuth.

Optionally, the third determining unit includes: a first input unit and a coefficient determination unit;

the first input unit configured to perform: inputting the target azimuth angle and the sound source azimuth angle into a sub-band filter coefficient calculation model;

the coefficient determination unit is configured to perform: determining the subband filter coefficient output by the subband filter coefficient calculation model as the target subband filter coefficient; wherein:

the subband filter coefficient calculation model is as follows:

h_voice(f_i)＝(0.5+0.5cos(θ(f_i)-θ_voice))¹⁰；

Optionally, the denoising processing unit includes: a second input unit and a fourth obtaining unit;

the second input unit configured to perform: inputting the target sub-band filter coefficient and the voice signal positioned on the target azimuth angle into a denoising calculation model;

the fourth obtaining unit is configured to perform: obtaining a denoised voice signal output by the denoising calculation model; wherein, the denoising calculation model is as follows:

y(f_i)＝x(f_i)·h_voice(f_i)；

Optionally, the head coordinates comprise a head abscissa;

the fifth determination unit includes: a third input unit and an azimuth angle determination unit;

the third input unit configured to perform: inputting the depth distance and the head abscissa into an azimuth calculation model;

the azimuth determination unit is configured to perform: determining an azimuth angle output by the azimuth angle calculation model as the first azimuth angle; wherein the azimuth calculation model is:

The method and the device for denoising the voice signal provided by the embodiment can obtain a plurality of voice signals collected by a microphone array, the plurality of voice signals include noise, perform sound source localization on the plurality of voice signals, determine a sound source azimuth, determine the voice signal located on a target azimuth as the noise, determine an angle deviation value between the target azimuth and the sound source azimuth not less than a preset angle deviation threshold, determine a corresponding target subband filter coefficient based on the angle deviation value between the target azimuth and the sound source azimuth, and perform denoising processing on the voice signal located on the target azimuth based on the target subband filter coefficient. The invention can avoid influencing the voice signal output by the target sound source, realize effective denoising processing on the collected multi-path voice signal and effectively improve the audio denoising capability.

The foregoing description is only an overview of the technical solutions of the present invention, and the following detailed description of the present invention is provided to enable the technical means of the present invention to be more clearly understood, and to enable the above and other objects, features, and advantages of the present invention to be more clearly understood.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart illustrating a first speech signal denoising method according to an embodiment of the present invention;

FIG. 2 illustrates a uniform circular microphone array provided by embodiments of the present invention;

FIG. 3 is a schematic diagram showing the variation of subband filter coefficients with respect to azimuth;

FIG. 4 is a flowchart illustrating a second speech signal denoising method according to an embodiment of the present invention;

fig. 5 shows a schematic structural diagram of a first speech signal denoising apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

As shown in fig. 1, the present embodiment provides a first speech signal denoising method, which may include the following steps:

s101, obtaining a plurality of paths of voice signals collected by a microphone array, wherein the plurality of paths of voice signals comprise noise;

wherein the microphone array may be constituted by a plurality of microphones. Each microphone in the microphone array can acquire a path of voice signal.

It should be noted that, during the period of collecting the voice signal output by the target sound source (e.g. a speaker), the microphone array may simultaneously collect the voice signal output by a non-target sound source (e.g. a non-other speaker, an air conditioner, or a television). At the moment, the voice signals output by the non-target sound source and collected by the microphone array belong to noise, and the invention can perform denoising processing on the collected whole voice signals and improve the voice signal collection quality.

It is understood that the above-mentioned multi-path speech signal may be a speech signal collected by a microphone array and output by a target sound source.

S102, carrying out sound source positioning on the multi-channel voice signals, and determining a sound source azimuth angle;

the azimuth of the sound source may be the azimuth of the target sound source.

Alternatively, the sound source azimuth may be an azimuth of the target sound source in a first spatial coordinate system established with the center point of the microphone array as an origin.

Specifically, after the multi-channel voice signals are collected, the sound source of the collected multi-channel voice signals is positioned in the first space coordinate system by using the microphone array sound source positioning technology, and the azimuth angle of the sound source is determined. Then, the invention can identify the noise in the multi-path voice signals by utilizing the azimuth angle of the sound source.

When the microphone array sound source positioning technology is used for sound source positioning, the invention can specifically perform sound source positioning based on a beam forming mode. Specifically, in the process of sound source localization based on the beam forming mode, the invention can firstly carry out time delay compensation on the collected multi-path voice signals, carry out fixed beam forming in each direction of the space, and determine the direction with the maximum beam output power as the sound source direction, thereby determining the azimuth angle of the sound source.

To better illustrate the beamforming-based sound source localization technique, the present invention introduces it by taking a uniform circular array as an example. As shown in fig. 2, the microphone array may be a uniform circular microphone array formed by N microphones uniformly distributed on a circumference, the radius of the circumference is r, θ is the angle between the speech signal and x, i.e. the azimuth angle of the sound source (the sound source is a speaker),

the coordinate of each microphone can be recorded as (x) as the positive direction included angle between the voice signal and the z-axis, i.e. the pitch angle of the speaker_n，y_n)(n＝1,2,…,N)。

Of the nth microphoneRelative time delay tau_nCan be as follows:

where c may be the speed of sound in air.

Specifically, the present invention may calculate the beam output power at each angle based on the relative time delay, and determine the angle corresponding to the maximum beam output power as the azimuth of the sound source.

Optionally, after the multi-channel voice signals are collected, the collected multi-channel voice signals are preprocessed in advance. Specifically, in the preprocessing process, the high-pass filtering processing may be performed on the collected signals in advance to remove the dc component in the multiple voice signals, and then the corresponding sampling rate may be determined according to the frequency bandwidth of the multiple voice signals to perform the down-sampling processing on the multiple voice signals.

Optionally, the invention may perform subband analysis on the preprocessed multiple voice signals to obtain corresponding frequency domain signals. Then, the invention can perform echo cancellation on the frequency domain signal, for example, the adaptive filter can be used to simulate the echo signal received by the microphone, and the frequency domain signal after echo cancellation is obtained by adopting spectral subtraction.

Optionally, after obtaining the frequency domain signal after performing echo cancellation, the present invention may perform active speech detection on the frequency domain signal by calculating a signal-to-noise ratio and a log spectrum deviation thereof, so as to determine whether the multi-channel speech signal actually belongs to noise (i.e., an invalid speech signal) or an valid speech signal.

Optionally, the method may perform the subsequent steps when determining that the multiple voice signals actually belong to valid voice signals; when the invention determines that the multi-channel voice signal actually belongs to the noise, the execution of the subsequent steps can be forbidden, and the resource consumption is reduced.

Optionally, when determining that the multiple voice signals actually belong to effective voice signals, the invention may perform sound source localization on the collected multiple voice signals by using a microphone array sound source localization technology.

S103, determining the voice signal on the target azimuth as noise, wherein the angle deviation value of the target azimuth and the sound source azimuth is not less than a preset angle deviation threshold;

the angle deviation value may be an absolute value of a value obtained by subtracting the azimuth of the sound source from the target azimuth.

The target azimuth may be an azimuth having an angle deviation value from the azimuth of the sound source not less than the angle deviation threshold.

It should be noted that the angle deviation threshold may be set by a technician according to an actual working condition, and the present invention is not limited thereto. For example, the angle deviation threshold may be set to 5 degrees or 10 degrees. Of course, the angle deviation threshold may be 0 degrees.

Specifically, the present invention may determine, in the above-mentioned multiple voice signals, a voice signal at a target azimuth as a voice signal output by a non-target sound source, that is, as noise. It should be noted that, when the angle deviation threshold is 0 degree, the present invention may determine all the voice signals at the azimuth of the non-sound source as noise.

It is understood that when the sound source azimuth and the angle deviation threshold are determined, the target azimuth may include a plurality of azimuths, and the voice signals at different target azimuths may be different noises. For example, when the azimuth of the sound source is 60 degrees and the angle deviation threshold is 5 degrees, the target azimuth may be 50 degrees or 70 degrees, and in the multi-path speech signal, the speech signal with the azimuth of 50 degrees may be determined as the first noise, and the speech signal with the azimuth of 70 degrees may be determined as the second noise.

S104, determining corresponding target sub-band filter coefficients based on the angle deviation value of the target azimuth and the sound source azimuth;

wherein the target subband filter coefficient may be a subband filter coefficient corresponding to the target azimuth.

Specifically, the present invention may determine a target subband filter coefficient corresponding to noise located on the target azimuth based on the angle deviation value between the target azimuth and the sound source azimuth, design a corresponding subband filter based on the target subband filter, and perform denoising processing on the noise located on the target azimuth.

Optionally, step S104 may include:

inputting the target azimuth angle and the sound source azimuth angle into a sub-band filter coefficient calculation model, and determining a sub-band filter coefficient output by the sub-band filter coefficient calculation model as a target sub-band filter coefficient; wherein:

the subband filter coefficient calculation model is as follows:

h_voice(f_i)＝(0.5+0.5cos(θ(f_i)-θ_voice))¹⁰；

wherein i is the sequence number of the frequency spectrum line in the frequency domain of the multi-channel voice signal, f_iIs the frequency, h, corresponding to the frequency spectrum line with the sequence number i_voice(fi) is the frequency f_iCorresponding sub-band filter coefficients, θ (fi) being the frequency f_iCorresponding target azimuth angle, θ_voiceIs the sound source azimuth.

It is understood that the subband filter coefficient calculating model may output a corresponding subband filter coefficient based on an angle deviation value of the target azimuth angle from the azimuth angle of the sound source.

And S105, denoising the voice signal positioned on the target azimuth angle based on the target subband filter coefficient.

Specifically, the present invention may construct a corresponding subband filter based on a target subband filter coefficient corresponding to the target azimuth, and then perform directional denoising processing on the noise located on the target azimuth by using the constructed subband filter.

Optionally, when determining the target subband filter coefficient by using the subband filter coefficient calculation model, step S105 may include:

inputting the target sub-band filter coefficient and the voice signal positioned on the target azimuth angle into a denoising calculation model to obtain a denoised voice signal output by the denoising calculation model; the denoising calculation model is as follows:

y(f_i)＝x(f_i)·h_voice(f_i)；

where y (fi) is the denoised speech signal and x (fi) is the speech signal at the target azimuth.

It is understood that the denoising calculation model may be a subband filter corresponding to the target subband filter coefficient.

Specifically, the present invention may construct a corresponding subband filter, i.e. the denoising calculation model, after determining the target subband filter coefficient. Then, the invention can input the voice signal on the target azimuth in the multi-channel voice signals into the constructed sub-band filter, and the sub-band filter eliminates the voice signal on the target azimuth in a certain proportion, namely, the parameters such as the signal intensity or the frequency of the voice signal are multiplied by the coefficient of the target sub-band filter, and the voice signal which is output by the sub-band filter and is subjected to the denoising processing is obtained. The relationship of the target subband filter coefficients to the target azimuth angle may be as shown in fig. 3. In fig. 3, the horizontal axis may be an azimuth, the vertical axis may be a noise attenuation coefficient (which may be used to represent a target subband filter coefficient), the azimuth of the sound source is 60 degrees, and when the angular deviation value between the target azimuth and the azimuth of the sound source is larger, the noise attenuation coefficient corresponding to the target azimuth is smaller, that is, the smaller the target subband filter coefficient is, the greater the intensity of the denoising process performed on the noise will be.

It should be further noted that, by performing directional denoising processing on the voice signal, i.e. noise, on the target azimuth angle, the invention can perform effective denoising processing on the collected multi-path voice signal while avoiding affecting the voice signal output by the target sound source, thereby effectively improving the audio denoising capability and the voice signal collection quality and processing quality.

The method for denoising a voice signal provided in this embodiment may obtain a plurality of voice signals collected by a microphone array, where the plurality of voice signals include noise, perform sound source localization on the plurality of voice signals, determine a sound source azimuth, determine a voice signal located on a target azimuth as the noise, determine an angle deviation value between the target azimuth and the sound source azimuth not less than a preset angle deviation threshold, determine a corresponding target subband filter coefficient based on the angle deviation value between the target azimuth and the sound source azimuth, and perform denoising processing on the voice signal located on the target azimuth based on the target subband filter coefficient. The invention can realize effective denoising processing on the collected multi-channel voice signals while avoiding influencing the voice signals output by the target sound source, effectively improve the audio denoising capability and improve the voice signal collection quality and the processing quality.

Based on the steps shown in fig. 1, as shown in fig. 4, the present embodiment proposes a second speech signal denoising method. In this method, step S102 may be embodied as step S201. Wherein:

s201, sound source positioning is carried out on the multi-path voice signals by combining a microphone array sound source positioning technology and an image recognition technology, and a sound source azimuth angle is determined.

It should be noted that, when the sound source is a speaker, the present invention can perform sound source localization by combining the speaker localization technology in speech recognition (such as the microphone array sound source localization technology) and the human key point recognition technology in image recognition.

The invention can use the human key point identification technology in image identification as the auxiliary positioning mode of the microphone array sound source positioning technology to position the sound source of the multi-channel voice signals collected by the microphone array.

Optionally, the azimuth angle obtained by the microphone array sound source positioning technology can be compared and fitted with the azimuth angle obtained by image recognition, so that the accuracy of the azimuth angle of the sound source is improved.

Optionally, step S201 may include steps S301, S302, S303, S304, S305, S306, S307, and S308. Wherein:

s301, obtaining a target image which is shot by a camera and contains a speaker;

s302, recognizing key points of a human body of a speaker in a target image by utilizing an image recognition technology;

s303, determining the coordinates of the key points of the human body of the speaker;

s304, determining a first azimuth angle of the speaker based on the coordinates of the key points of the human body;

the first azimuth may be an azimuth obtained by positioning a speaker through an image recognition technique.

Optionally, steps S301, S302, S303, and S304 may include:

obtaining a target image which is shot by a depth camera and contains a speaker, and determining the depth distance from the speaker to the depth camera in the target image;

identifying key points of a human body of the speaker in the target image by using a human body posture estimation technology, and determining the head coordinate of the speaker;

based on the depth distance and the head coordinates, a first azimuth is determined.

Specifically, after the target image is obtained, the foreground segmentation can be performed on the target image in a background subtraction mode, the human body target region of the speaker is extracted from the target image, then the human body posture estimation can be performed on the human body target region of the speaker, and the skeleton model shown in fig. 5 is used for displaying all joint points of the skeleton of the human body, so that the coordinates of all key points of the human body can be determined, and the head coordinates of the speaker can be obtained.

Optionally, in the process of performing foreground segmentation on the target image by using a background subtraction method, the motion region of the speaker may be detected by using a difference between the current image and the target image according to the following formula (1).

Wherein, I_b(x) May be the current background image, I_c(x) The foreground X obtained by segmentation can be the current frame image, T is the threshold value₀Is the initial position of the speaker.

Specifically, the invention can determine the initial position X of the speaker₀Then, a target tracking mode based on particle filtering is adoptedAnd completing tracking estimation of the speaker through the steps of particle set initialization, importance sampling, resampling, updating, circulation and the like.

Optionally, the method of the invention may determine the human body key point coordinates of the speaker by using human body posture estimation in the human body image block obtained by foreground segmentation and particle filtering processing. Specifically, the method can divide the whole human body image block into image blocks of different areas, such as a head, a trunk and four limbs, in the human body image block according to parameters such as the outline, the edge and the texture of an image, and then can extract human body joint points from each divided image block according to a human body skeleton model to serve as initial parameters of the human body posture at the current moment; then, the invention can search in the available space of each joint parameter by using an optimization theory and a mode according to the initial parameters, and find the joint position which is most matched with the observed image data to obtain more matched human body posture parameters; then, the optimal joint point position obtained by matching can be updated into the initial parameter by the method, and the subsequent parameter estimation can take the parameter as reference.

It should be noted that, in order to further improve the accuracy of the posture estimation, the prior knowledge of the human motion can be used to constrain the human posture estimation.

Specifically, the center of the depth camera can be used as an origin, a second space coordinate system is established, and the azimuth of the speaker, namely the first azimuth, is determined in the second space coordinate system based on the depth distance and the head coordinate of the speaker in the target image.

The present invention may calculate the first azimuth using the following formula (2). At this time, the determining the first azimuth angle based on the depth distance and the head coordinate may include:

wherein, theta₁May be a first azimuth angle, x₁May be the head abscissa of the speaker and d may be the depth distance of the speaker in the target image.

It should be noted that the camera and the microphone array may be combined into an integrated device. When the installation position of the integrated device is determined, the positions of the camera and the microphone array can be determined accordingly.

Specifically, the second spatial coordinate system established with the camera as an origin may be different from the first spatial coordinate system established with the center of the microphone array as an origin, and directions of the other axes, i.e., the x-axis, the y-axis, and the z-axis, may be the same.

S305, carrying out sound source positioning on the multi-channel voice signals by using a microphone array sound source positioning technology, and determining a second azimuth angle of a target sound source;

the second azimuth may be an azimuth obtained by performing sound source localization on the multiple voice signals through a microphone array sound source localization technique.

S306, judging whether the first azimuth angle is matched with the second azimuth angle, and if so, executing the step S307; otherwise, step S308 is executed.

S307, determining that the speaker is the target sound source, and determining the sound source azimuth angle based on the first azimuth angle and the second azimuth angle.

Specifically, when the first azimuth angle is determined to be matched with the second azimuth angle, the speaker can be determined to be the target sound source, and the azimuth angle of the sound source can be further determined according to the first azimuth angle and the second azimuth angle.

S308, determining that the speaker is a non-target sound source, and forbidding determining the sound source azimuth angle based on the first azimuth angle and the second azimuth angle to avoid unnecessary resource consumption.

Specifically, when the first azimuth angle and the second azimuth angle are not matched, the method and the device can determine that the speaker is a non-target sound source, and forbid the determination of the sound source azimuth angle according to the first azimuth angle and the second azimuth angle, so that unnecessary resource consumption is avoided.

When sound source localization is performed in a certain location where multiple persons exist, multiple paths of voice signals may be collected by using a microphone array in the location, and each person in the location may be photographed by using a camera to obtain multiple images (each image may include one person), each photographed image may be used as a target image (the person included in each target image may be assumed to be a speaker), and sound source localization may be performed by using a microphone array sound source localization technique and an image recognition technique, that is, sound source localization may be performed on multiple paths of voice signals collected by using the microphone array by performing steps S301, S302, S303, S304, S305, S306, S307, and S308.

For example, when two persons, namely three persons and four persons, exist in the a site, the invention can acquire a plurality of paths of voice signals by using the microphone array in the a site, and simultaneously can use the camera to respectively shoot the three persons and the four persons to obtain a first image containing the three persons and a second image containing the four persons, and then can firstly use the first image as a target image, and execute the steps S301, S302, S303, S304, S305, S306, S307 and S308 to perform sound source localization on the plurality of paths of voice signals acquired by the microphone array; then, the second image can be used as a target image, and the steps S301, S302, S303, S304, S305, S306, S307 and S308 can be performed to perform sound source localization on the multi-channel voice signals collected by the microphone array.

It can be understood that, if it is determined that the speaker in a certain target image is a non-target sound source, the present invention may continue to perform sound source localization by using the next target image until the target image including the target sound source is determined, so as to calculate the sound source azimuth angle by using the target image including the target sound source, thereby improving the sound source localization accuracy.

Optionally, when a certain image shot by the camera contains a plurality of persons, the invention can extract the human body areas of the persons respectively from the images in advance, and respectively use the human body areas of the persons as a target image to perform sound source positioning.

Alternatively, the first azimuth angle is obtained in the camera coordinate system (i.e. the second spatial coordinate system), and the second azimuth angle is obtained in the microphone array coordinate system (i.e. the first spatial coordinate system), and step S306 may include:

carrying out coordinate system transformation on the first azimuth to obtain a third azimuth corresponding to the first azimuth in a microphone array coordinate system;

comparing whether the difference value of the third azimuth angle and the second azimuth angle is not larger than a preset azimuth angle deviation threshold value, and if so, determining that the first azimuth angle is matched with the second azimuth angle; otherwise, the first azimuth angle is determined not to match the second azimuth angle.

The azimuth deviation threshold may be determined by a skilled person according to actual conditions, which is not limited by the present invention.

Optionally, step S307 may include:

determining that the speaker is the target sound source, and determining the sound source azimuth based on the third azimuth and the second azimuth.

Optionally, when determining that the first azimuth matches the second azimuth, the present invention may determine the azimuth of the sound source based on the third azimuth and the second azimuth. In the process of determining the azimuth angle of the sound source, the invention can carry out weighted average on the third azimuth angle and the second azimuth angle, and the value obtained by the weighted average is determined as the azimuth angle of the sound source;

alternatively, the present invention may directly determine the third azimuth or the second azimuth as the sound source azimuth.

It should be noted that the present invention can improve the sound source localization precision of the speaker by the sound source localization method through the speaker localization technique in the speech recognition and the human body key point recognition technique in the image recognition, so as to further improve the denoising capability of the target sound source output speech signal.

The voice signal denoising method provided by the embodiment can perform sound source localization by using a speaker localization technology in voice recognition and a human body key point recognition technology in image recognition, and can improve the sound source localization precision of a speaker, thereby further improving the denoising capability of outputting a voice signal to a target sound source.

Corresponding to the steps shown in fig. 1, as shown in fig. 5, the present embodiment provides a first speech signal denoising apparatus, which may include: a first obtaining unit 101, a first positioning unit 102, a first determining unit 103, a second determining unit 104, a third determining unit 105, and a denoising processing unit 106, wherein:

a first obtaining unit 101 configured to perform: obtaining a plurality of paths of voice signals collected by a microphone array, wherein the plurality of paths of voice signals comprise noise;

A first positioning unit 102 configured to perform: carrying out sound source positioning on the multi-channel voice signals;

a first determination unit 103 configured to perform: determining a sound source azimuth angle;

the azimuth of the sound source may be the azimuth of the target sound source.

A second determining unit 104 configured to perform: determining the voice signal positioned on the target azimuth as noise, wherein the angle deviation value of the target azimuth and the sound source azimuth is not less than a preset angle deviation threshold;

It should be noted that the angle deviation threshold may be set by a technician according to an actual working condition, and the present invention is not limited thereto.

A third determining unit 105 configured to perform: determining a corresponding target sub-band filter coefficient based on the angle deviation value of the target azimuth and the sound source azimuth;

Optionally, the third determining unit 105 includes: a first input unit and a coefficient determination unit;

a first input unit configured to perform: inputting a target azimuth angle and a sound source azimuth angle into a sub-band filter coefficient calculation model;

a coefficient determination unit configured to perform: determining the sub-band filter coefficient output by the sub-band filter coefficient calculation model as a target sub-band filter coefficient; wherein:

the subband filter coefficient calculation model is as follows:

h_voice(f_i)＝(0.5+0.5cos(θ(f_i)-θ_voice))¹⁰；

where i is the sequence number of the frequency spectrum in the frequency domain of the multi-path speech signal, f_iIs the frequency, h, corresponding to the frequency spectrum line with the sequence number i_voice(fi) is the frequency f_iCorresponding sub-band filter coefficients, θ (fi) being the frequency f_iCorresponding target azimuth angle, θ_voiceIs the sound source azimuth.

A denoising processing unit 106 configured to perform: and denoising the voice signal positioned on the target azimuth angle based on the target subband filter coefficient.

Optionally, when determining the target subband filter coefficient by using the subband filter coefficient calculation model, the denoising processing unit 106 includes: a second input unit and a fourth obtaining unit;

a second input unit configured to perform: inputting the target sub-band filter coefficient and the voice signal positioned on the target azimuth angle into a denoising calculation model;

a fourth obtaining unit configured to perform: obtaining a denoised voice signal output by a denoising calculation model; the denoising calculation model is as follows:

y(f_i)＝x(f_i)·h_voice(f_i)；

Specifically, the present invention may construct a corresponding subband filter, i.e. the denoising calculation model, after determining the target subband filter coefficient. Then, the invention can input the voice signal on the target azimuth in the multi-channel voice signals into the constructed sub-band filter, and the sub-band filter eliminates the voice signal on the target azimuth in a certain proportion, namely, the parameters such as the signal intensity or the frequency of the voice signal are multiplied by the coefficient of the target sub-band filter, and the voice signal which is output by the sub-band filter and is subjected to the denoising processing is obtained.

The voice signal denoising device provided by the embodiment can effectively denoise collected multi-channel voice signals while avoiding influencing the voice signals output by a target sound source, effectively improves the audio denoising capability, and improves the voice signal collection quality and the processing quality.

Based on fig. 5, the present embodiment proposes a second speech signal denoising device. In the apparatus, a first positioning unit 102 configured to perform:

Optionally, the first positioning unit 102 includes: a second obtaining unit, a first identifying unit, a fourth determining unit, a fifth determining unit, a second positioning unit, a sixth determining unit, a judging unit and a seventh determining unit, wherein:

a second obtaining unit configured to perform: obtaining a target image containing a speaker shot by a camera;

a first identification unit configured to perform: recognizing key points of a human body of a speaker in a target image by utilizing an image recognition technology;

a fourth determination unit configured to perform: determining the coordinates of human key points of a speaker;

a fifth determination unit configured to perform: determining a first azimuth angle of the speaker based on the coordinates of the key points of the human body;

a second positioning unit configured to perform: carrying out sound source positioning on the multi-path voice signals by using a microphone array sound source positioning technology;

a sixth determination unit configured to perform: determining a second azimuth angle of the target sound source;

a determination unit configured to perform: judging whether the first azimuth angle is matched with the second azimuth angle, and if so, triggering a seventh determining unit;

a seventh determining unit configured to perform: determining that the speaker is a target sound source;

a first determination unit 103 configured to perform: a sound source azimuth is determined based on the first azimuth and the second azimuth.

Optionally, the second obtaining unit is configured to perform: obtaining a target image which is shot by a depth camera and contains a speaker;

a first identification unit comprising: an eighth determining unit and a second identifying unit;

an eighth determination unit configured to perform: determining the depth distance from the speaker to a depth camera in the target image;

a second recognition unit configured to perform: identifying key points of a human body of a speaker in a target image by using a human body posture estimation technology;

a fourth determination unit configured to perform: determining the head coordinates of the speaker;

a fifth determination unit configured to perform: based on the depth distance and the head coordinates, a first azimuth is determined.

The present invention can calculate the first azimuth using the following equation (3). At this time, the head coordinates include a head abscissa;

a fifth determination unit including: a third input unit and an azimuth angle determination unit;

a third input unit configured to perform: inputting the depth distance and the head abscissa into an azimuth angle calculation model;

an azimuth determination unit configured to perform: determining the azimuth angle output by the azimuth angle calculation model as a first azimuth angle; wherein, the azimuth calculation model is as follows:

wherein, theta₁Is a first azimuth angle, x₁Is the head abscissa and d is the depth distance.

a determination unit including: the device comprises a transformation unit, a third obtaining unit, a comparison unit and a ninth determining unit;

a transformation unit configured to perform: transforming a coordinate system of the first azimuth;

a third obtaining unit configured to perform: obtaining a third azimuth corresponding to the first azimuth in a microphone array coordinate system;

a comparison unit configured to perform: comparing whether the difference value of the third azimuth angle and the second azimuth angle is not larger than a preset azimuth angle deviation threshold value, and if so, triggering a ninth determining unit;

a ninth determining unit configured to perform: determining that the first azimuth angle matches the second azimuth angle;

a first determination unit 103 configured to perform: and determining the azimuth angle of the sound source based on the third azimuth angle and the second azimuth angle.

The voice signal denoising device provided by the embodiment can perform sound source localization by using a speaker localization technology in voice recognition and a human body key point recognition technology in image recognition, and can improve the sound source localization precision of a speaker, thereby further improving the denoising capability of outputting a voice signal to a target sound source.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for denoising a speech signal, comprising:

2. The method of claim 1, wherein said performing sound source localization on said multiple voice signals and determining a sound source azimuth angle comprises:

3. The method of claim 2, wherein the combining a microphone array sound source localization technique and an image recognition technique to perform sound source localization on the multi-path speech signal and determine the sound source azimuth comprises:

4. The method according to claim 3, wherein the obtaining a target image including a speaker captured by a camera, performing a human keypoint recognition on the speaker in the target image by using an image recognition technique, determining a human keypoint coordinate of the speaker, and determining a first azimuth of the speaker based on the human keypoint coordinate comprises:

5. The method of claim 3 or 4, wherein the first azimuth angle is obtained in a camera coordinate system, wherein the second azimuth angle is obtained in a microphone array coordinate system, and wherein the determining whether the first azimuth angle and the second azimuth angle match comprises:

6. The method of claim 1, wherein determining the corresponding target subband filter coefficients based on the angular deviation values of the target azimuth angle and the sound source azimuth angle comprises:

the subband filter coefficient calculation model is as follows:

h_voice(f_i)＝(0.5+0.5cos(θ(f_i)-θ_voice))¹⁰；

7. The method of claim 6, wherein denoising the speech signal at a target azimuth based on the target subband filter coefficients comprises:

y(f_i)＝x(f_i)·h_voice(f_i)；

8. The method of claim 4, wherein the head coordinates comprise a head abscissa, and wherein determining the first azimuth angle based on the depth distance and the head coordinates comprises:

9. A speech signal denoising apparatus, comprising: the device comprises a first obtaining unit, a sound source positioning unit, a first determining unit, a second determining unit, a third determining unit and a denoising processing unit, wherein:

the sound source localization unit is configured to perform: carrying out sound source positioning on the multi-channel voice signals;

10. The apparatus according to claim 9, wherein the sound source localization unit is configured to perform: