CN113450769B

CN113450769B - Speech extraction method, device, equipment and storage medium

Info

Publication number: CN113450769B
Application number: CN202010158648.7A
Authority: CN
Inventors: 童仁杰
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Filing date: 2020-03-09
Publication date: 2024-06-25
Anticipated expiration: 2040-03-09

Abstract

The invention provides a voice extraction method, a voice extraction device, voice extraction equipment and a storage medium. The method comprises the following steps: acquiring an image at a target sound source; determining the DOA of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and preset output signals of N wave beams; the N wave beams are preset wave beams with different directions by taking the microphone array as a reference, and N is more than or equal to 2. According to the embodiment of the invention, under the condition that the signal-to-noise ratio of the voice signal is low, particularly under the condition of a long-distance whisper, the DOA of the target sound source is determined according to the information of the image at the target sound source, so that the accuracy of DOA estimation can be improved, and the quality of the extracted voice signal can be improved.

Description

Speech extraction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of audio signal processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting speech.

Background

At present, the application requirements of remote pickup are very wide. For example, in some monitoring scenarios a secret pickup is required. But the current remote pickup technology cannot achieve the effect of close-range pickup.

In the related art, a fixed beam directed at a plurality of azimuth angles is designed using a microphone array technique, and the minimum energy value within each beam is tracked. And (5) synthesizing the minimum value tracking results of the energy of each beam, and detecting the target beam where the sound source is located. Then, the beam forming algorithm is used for suppressing the environmental noise so as to extract the voice output signal, however, under the scene of long distance and low signal to noise ratio, the estimation of the target beam only according to the minimum value of each beam energy is easy to be wrong, and the quality of the extracted voice output signal is not high.

Disclosure of Invention

The invention provides a voice extraction method, a voice extraction device, voice extraction equipment and a storage medium, so as to improve voice extraction quality.

In a first aspect, the present invention provides a method for extracting speech, including:

Acquiring an image at a target sound source;

Determining the DOA of the target sound source according to the pixel position of the target sound source in the image;

extracting a voice output signal of a target sound source according to the DOA and preset output signals of N wave beams; the N wave beams are preset wave beams with different directions by taking the microphone array as a reference, and N is more than or equal to 2.

In a second aspect, the present invention provides a speech extraction apparatus comprising:

The acquisition module is used for acquiring an image at the target sound source;

The determining module is used for determining the DOA of the target sound source according to the pixel position of the target sound source in the image;

The processing module is used for extracting a voice output signal of a target sound source according to the DOA and preset output signals of N wave beams; the N wave beams are preset wave beams with different directions by taking the microphone array as a reference, and N is more than or equal to 2.

In a third aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the first aspects.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including:

A processor; and

A memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of the first aspects via execution of the executable instructions.

The voice extraction method, the voice extraction device, the voice extraction equipment and the storage medium provided by the embodiment of the invention acquire the image of the target sound source; determining the DOA of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and preset output signals of N wave beams; the N beams are beams with different directions and preset by taking the microphone array as a reference, N is more than or equal to 2, and under the condition that the signal-to-noise ratio of the voice signal is low, particularly under the condition of long-distance whisper, the DOA of the target sound source is determined according to the information of the image at the target sound source, so that the accuracy of DOA estimation can be improved, and further, the voice output signal of the target sound source is extracted according to the DOA, and the quality of the extracted voice signal can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of an implementation of the principles provided by an embodiment of the present invention;

FIG. 2 is a flowchart of an embodiment of a method for extracting speech according to the present invention;

FIG. 3 is a schematic diagram of beam forming according to an embodiment of the method provided by the present invention;

FIG. 4 is a schematic diagram of an imaging principle of an embodiment of a method provided by the present invention;

FIG. 5 is a schematic flow chart diagram of another embodiment of the method provided by the present invention;

FIG. 6 is a schematic diagram of a voice extraction device according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided by the present invention.

Specific embodiments of the present disclosure have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terms "comprising" and "having" and any variations thereof in the description and claims of the invention and in the drawings are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Firstly, the nouns and application scenes related to the invention are introduced:

Microphone array: a plurality of microphones arranged in a geometric shape. Typically each microphone is non-directional and the frequency response is well consistent from microphone to microphone.

Direction of arrival (Direction Of Arrival, DOA for short): the plane wave reaches the direction of the microphone array. The position of the radiation source is estimated by measuring the direction of arrival of the radiation signal.

Beamforming: the audio signals output by the plurality of microphones are weighted and summed to obtain an enhanced speech signal.

Scattering noise: noise fields of equal power in all directions.

Voice activity detection (Voice Activity Detection, VAD) algorithm: it is detected whether a piece of audio contains human voice activity.

The method provided by the embodiment of the invention is applied to an intelligent monitoring system, for example, the voice is monitored, especially in a long-distance whisper scene, so as to improve the voice extraction quality. The monitoring system can comprise an image acquisition component, a sound acquisition component and a processor chip, wherein the image acquisition component, the sound acquisition component and the processor chip can be integrated on one device or a plurality of devices.

Wherein the image acquisition assembly comprises, for example: the lens, the image sensor, the sound collection assembly may be, for example, a microphone array comprising at least two microphones. The arrangement of the microphone array may be set according to requirements, such as a ring, a polygon, a spiral, etc.

As shown in fig. 1, the image pickup device is, for example, a camera 1, a microphone array including 4 microphones 2 is arranged in a circular array, and the camera and the microphone array are fixed by a fixing member 3.

According to the method, the current scene mode is determined through the collected image data, if the scene mode is in the private mode, DOA is estimated according to the image data, accuracy is high, and then the voice output signal of the target sound source is extracted according to the estimated DOA through the output signals of the beams with different directions.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 2 is a flow chart of an embodiment of a voice extraction method according to the present invention. As shown in fig. 1, the method provided in this embodiment includes:

and 101, acquiring an image at a target sound source.

According to the image information collected by the image collecting component, whether a person speaks the whisper or not is judged, and the whisper is generally accompanied with obvious limb action characteristics such as an intersection lug and the like. For example, if the distance between two adjacent faces is detected in an image to be very close, the two people may be more likely to speak silently.

The image acquisition component can acquire a plurality of images at different positions and different angles in the current scene, and a target sound source is positioned according to the acquired images.

Let the face detection algorithm detect N faces altogether, and the pixel positions of the center points of the face regions are (x _i,y_i), i=1, 2. According to the pixel distance of the adjacent face in one image, whether two people are communicating with the whisper can be judged.

In one implementation, a pixel distance between adjacent faces in the image may be determined;

If the pixel distance is less than the preset threshold, the operation of step 102 is performed.

The pixel distance between adjacent faces may be determined by:

determining pixel positions of center points of adjacent faces in the image;

and determining the pixel distance between the adjacent faces according to the pixel positions of the central points of the adjacent faces in the image.

Specifically, the pixel distance between adjacent ith and jth faces may be expressed asIn other embodiments, the pixel distance may be calculated in other ways, which the present application is not limited to.

When the pixel distance of the center point of the adjacent face area is lower than a preset threshold epsilon, the head of the two people can be judged to be very close, and the two people are most likely to communicate in a whisper mode. At this time, the whisper mode may be used for voice extraction. The decision process can be expressed as:

an indicator value of 1 indicates that a whisper mode is triggered; and when the value is 0, the normal mode is triggered.

Step 102, determining the DOA of the target sound source according to the pixel position of the target sound source in the image.

Specifically, assuming that the adjacent ith and jth faces are detected to be closer in distance and the whisper mode is triggered, the pixel position of the target sound source in the image at this time may be determined by the following manner:

Determining the pixel position of a target sound source in the image according to the pixel position of the center point of each face in the adjacent faces of the image; the distance between adjacent faces is less than a preset threshold.

In one implementation, the center positions of two center points of adjacent faces of an image may be determined as pixel positions of a target sound source in the image.

Or in other ways, the pixel position of the center point of any one of the adjacent faces may also be used as the pixel position of the target sound source in the image. Or, other pixel positions between two center points of the adjacent face are taken as the pixel positions of the target sound source in the image, which is not limited in the present application.

The pixel position (x _s,y_s) of the target sound source in the image can be calculated, for example, by the following formula:

(x_s,y_s)＝((x_i+x_j)/2,(y_i+y_j)/2)

And further determining the DOA of the target sound source according to the pixel position of the target sound source in the image, for example, determining the DOA of the target sound source according to the geometrical relationship between the pixel position of the target sound source in the image and the spatial position of the target sound source by using an imaging principle.

In summary, in the whisper mode, the voice signal-to-noise ratio is low, if the DOA is estimated by using a common sound source localization algorithm, the result is inaccurate, but in the embodiment of the application, the DOA is determined by using the image corresponding to the target sound source, and the result is more accurate.

Step 103, extracting a voice output signal of a target sound source according to DOA and preset output signals of N wave beams; the N wave beams are all wave beams with different directivities preset by taking the microphone array as a reference, and N is more than or equal to 2.

Specifically, after determining the DOA, noise suppression may be achieved through beamforming based on the DOA result, and a speech output signal may be extracted.

As shown in FIG. 3, according to azimuth angle, the space can be divided into N areas with central angle of areaThe value of l is 1-N, N is 6 for example, the microphone array in fig. 3 comprises four microphones, and the annular array is adopted, so that the target beam corresponding to the target sound source is determined by determining the weights corresponding to the N beams, and the voice output signal is extracted.

The method of the embodiment obtains an image at a target sound source; determining the DOA of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and preset output signals of N wave beams; the N beams are beams with different directions and preset by taking the microphone array as a reference, N is more than or equal to 2, and under the condition that the signal-to-noise ratio of the voice signal is low, particularly under the condition of long-distance whisper, the DOA of the target sound source is determined according to the information of the image at the target sound source, so that the accuracy of DOA estimation can be improved, and further, the voice output signal of the target sound source is extracted according to the DOA, and the quality of the extracted voice signal can be improved.

Based on the above embodiment, in another embodiment, step 102 of determining the DOA of the target sound source may be implemented as follows:

And determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between the lens and the image sensor in the image acquisition assembly, the pixel position of the center point of the lens in the image and the distance between adjacent photosensitive elements in the image sensor.

As shown in fig. 4, assuming that the distance between the lens and the image sensor is f1, the pixel position corresponding to the center point of the lens is (x ₀,y₀), and the distance between adjacent photosensitive elements is Δd, the pitch angle corresponding to the target sound source can be calculated by the following formula:

Other variations of the formula may be used in other embodiments to calculate the azimuth of the target sound source, and similar methods may be used to obtain the azimuth of the target sound source Thereby obtaining DOA.

As shown in fig. 5, audio-based DOA estimation is susceptible to factors such as signal-to-noise ratio, while video signals are not affected by speech signal-to-noise ratio. Therefore, if the normal mode is triggered (the pixel distance between adjacent faces in the image is greater than the preset threshold), it is indicated that the signal-to-noise ratio of the voice signal is higher, and the DOA of the target sound source can be obtained by adopting the traditional sound source localization algorithm (such as SRP, MUSIC, etc.). If the whisper mode is triggered (the pixel distance between adjacent faces in the image is smaller than the preset threshold), the signal-to-noise ratio is lower, and the DOA of the sound source can be estimated in the mode in the embodiment, that is, the DOA corresponding to the target sound source in the real space is obtained according to the pixel information of the image.

In the normal mode, the signal-to-noise ratio of the voice signal is high, and the VAD is utilized to judge whether voice activity exists or not; if no voice exists, outputting an original waveform acquired by the microphone array; otherwise, estimating DOA according to sound source localization algorithm, constructing beam forming device, suppressing environment noise, and extracting voice output signal.

Further, the weights of the N beams are obtained according to the noise distribution and DOA information, and the voice output signal of the target sound source is extracted.

In one embodiment, step 103 may be implemented as follows:

According to DOA, determining weights corresponding to N wave beams;

determining output signals of the N wave beams according to the weights corresponding to the N wave beams and the voice signals received by the microphone array;

and acquiring a voice output signal of the target sound source according to the output signals of the N wave beams.

Specifically, according to the DOA values calculated in the foregoing embodiments, weights of the N beams are calculated, and the target speech is extracted. In the following, a possible implementation of extracting a speech signal is described taking diffuse noise as an example.

The specific process is detailed as follows:

The scattering noise is uniformly distributed in space, which means that: the microphone array is taken as a reference point, and the noise power in all directions is equal. Assuming a number of microphones of M, for a diffuse noise field, the correlation coefficients for channel i and channel j at frequency f can be calculated as:

l _ij denotes the straight line distance of channels i and j, c denotes the speed of sound, Ω _ij (f) denotes the i-th row of the covariance matrix Ω (f), and the j-th column corresponds to the element. Wherein, the channel i is the channel corresponding to the ith microphone; channel j is the channel corresponding to the jth microphone.

In an embodiment, weights corresponding to the N beams are determined according to the covariance matrix and steering vectors corresponding to the N beams.

The guiding vectors corresponding to the N beams can be determined according to the pitch angle included in the DOA and the central azimuth angles of the spatial areas corresponding to the N beams.

Assuming that the DOA corresponding to a certain beam isIn particular by/>The delay of each microphone in the microphone array relative to the reference microphone is calculated.

The steering vectors for the N beams may be: wherein/> Representing the delay of the ith microphone relative to the reference microphone, is uniquely determined by the sound source orientation and array shape. The reference microphone is one of M microphones, for example, the microphone that receives the voice signal first.

The weight w _l (f) can be calculated, for example, by the following formula:

and w _l (f) is acted on input multi-channel audio data (namely voice signal vectors received by M microphones), so that the enhancement of target voice and the suppression of environmental noise can be realized.

Assuming that at a certain time-frequency point (t, f) the vector of the received speech signal of the microphone array is x (t, f), the output signals of the N beams are denoted y _l(t,f)＝w_l (f) x (t, f), l=1, 2, respectively. And then, obtaining the voice output signal of the target sound source through the output signals of the N wave beams, for example, determining the target wave beam in the N wave beams through the DOA of the target sound source, and further obtaining the output signal of the target wave beam. Further, the output signal of the target beam may be enhanced, e.g. multiplied by a certain gain, which may be a fixed preset value or calculated by other means.

Where x (t, f) may be a signal vector after frequency domain transformation processing by framing, for example, processing by short-time fourier transform.

In the method of the embodiment, due to accuracy of DOA estimation, weights corresponding to the N beams are determined according to the DOA; determining output signals of the N wave beams according to the weights corresponding to the N wave beams and the voice signals received by the microphone array; according to the output signals of the N wave beams, the voice output signals of the target sound source are obtained, and the quality of the extracted voice signals can be improved.

The above solution is low in complexity and easy to implement, but has high requirement on the accuracy of DOA estimation, and in other embodiments, in order to improve the stability of the algorithm, the weight may be determined in the following manner:

Determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the guide vectors corresponding to the N wave beams; the covariance matrix represents the covariance matrix of the scattered noise with frequency point f based on the microphone array.

For example, it can be determined by the following formula

Ω_ε(f)＝Ω(f)+ε·I

Wherein the diagonal loading factor epsilon controls the white noise gain and the beam width of the beamformer. Considering the DOA error and microphone mismatch, epsilon needs to be chosen so that the beam has good white noise gain and proper beamwidth. Epsilon may be determined based on actual requirements.

According to azimuth angleThe space can be divided into N _b areas with the central angle of the area being/>Respectively correspond to N _b wave beam formers/>Assuming that at some time-frequency point (t, f) the received signal vector of the microphone array is x (t, f), the output of the N _b beams can be represented as y _i(t,f)＝w_i(f)x(t,f),i＝1,2,...,N_b.

In one embodiment, to improve the quality of the extracted target speech, the following manner may be adopted:

Determining the probability of the target beam having the voice of the target sound source according to the output signals of the target beam corresponding to the target sound source and the output signals of the N beams; the target beam is one of N beams;

determining a third post-processing gain according to the first post-processing gain of the target beam with voice, the second post-processing gain of the target beam without voice and the probability;

And determining a voice output signal of the target sound source according to the third post-processing gain.

Specifically, assuming that the second post-processing gain of the target beam in which the voice does not exist is a preset fixed value G _min, and the first post-processing gain of the target beam in which the voice exists is G _s, where G _s may be obtained by a classical noise reduction algorithm, the total third post-processing gain may be calculated as: Where p represents the probability that the target beam is present for the speech of the target sound source.

In general, if the energy value of a certain beam is large, the likelihood that the voice of the target sound source is located in the beam is large, that is, the correlation of the voice existence probability p (t, f) and the energy size of each beam is extremely high. Assuming DOA values of the target sound sourceThe azimuth of the corresponding target beam is/>Then at time-frequency point (t, f), the probability that the target beam has speech can be calculated by the following formula:

the final speech output signal may be y _o(t,f)＝G·y_s (t, f).

According to the method, the probability that the target beam exists in the voice of the target sound source is determined according to the output signals of the target beam corresponding to the target sound source and the output signals of the N beams; the target beam is one of N beams; determining a third post-processing gain according to the first post-processing gain of the target beam with voice, the second post-processing gain of the target beam without voice and the probability; the quality of the extracted speech signal can be further improved by determining the speech output signal of the target sound source according to the third post-processing gain.

Fig. 6 is a block diagram of an embodiment of a voice extraction device according to the present invention, as shown in fig. 6, where the voice extraction device of the present embodiment includes:

an acquisition module 601, configured to acquire an image at a target sound source;

a determining module 602, configured to determine a direction of arrival DOA of the target sound source according to a pixel position of the target sound source in the image;

A processing module 603, configured to extract a speech output signal of a target sound source according to the DOA and preset output signals of the N beams; the N wave beams are preset wave beams with different directions by taking the microphone array as a reference, and N is more than or equal to 2.

In one possible implementation manner, the determining module 602 is specifically configured to:

determining a pixel distance between adjacent faces in the image;

and if the pixel distance is smaller than a preset threshold value, determining the DOA operation of the target sound source according to the pixel position of the target sound source in the image.

determining pixel positions of center points of adjacent faces in the image;

In one possible implementation, the determining module 602 is further configured to:

determining the pixel position of the target sound source in the image according to the pixel position of the center point of each face in the adjacent faces of the image; the distance between the adjacent faces is smaller than a preset threshold.

And determining the central positions of two central points of adjacent faces of the image as the pixel positions of the target sound source in the image.

And determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between a lens in the image acquisition assembly and an image sensor, the pixel position of a center point of the lens in the image, and the distance between adjacent photosensitive elements in the image sensor.

In one possible implementation manner, the processing module 603 is specifically configured to:

According to the DOA, determining weights corresponding to the N wave beams;

Determining the probability of the target beam having the voice of the target sound source according to the output signals of the target beam corresponding to the target sound source and the output signals of the N beams; the target beam is one beam of the N beams;

determining a guiding vector corresponding to the N wave beams according to the pitch angle included by the DOA and the central azimuth angle of the space region corresponding to each of the N wave beams;

Determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the guide vectors corresponding to the N wave beams; the covariance matrix represents the scattered noise with frequency point f based on the covariance matrix of the microphone array.

In one possible implementation, the processing module 603 is configured to:

Determining a target beam corresponding to the target sound source according to the azimuth angle included by the DOA and the central azimuth angle of the space region corresponding to each of the N beams;

And determining an output signal of the target beam corresponding to the target sound source according to the weight corresponding to the target beam corresponding to the target sound source and the voice signal received by the microphone array.

The device of the present embodiment may be used to execute the technical solution of the foregoing method embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 7 is a block diagram of an embodiment of an electronic device according to the present invention, as shown in fig. 7, where the electronic device includes:

A processor 701, a microphone array 702, an image acquisition component 703, wherein optionally a memory storing executable instructions of the processor 701 may also be included.

The image acquisition component 703 is used to acquire images. The microphone array 702 is used to collect voice signals.

The components may communicate via one or more buses.

The processor 701 is configured to execute the corresponding method in the foregoing method embodiment by executing the executable instruction, and the specific implementation process of the processor may refer to the foregoing method embodiment and will not be described herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, the computer program when executed by a processor implements a method corresponding to the foregoing method embodiment, and the specific implementation process of the computer program may refer to the foregoing method embodiment, and its implementation principle and technical effect are similar, and will not be repeated herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech extraction, comprising:

Acquiring an image at a target sound source;

determining a pixel distance between adjacent faces in the image;

If the pixel distance is smaller than a preset threshold value, determining the DOA of the target sound source according to the pixel position of the target sound source in the image;

extracting a voice output signal of the target sound source according to the DOA and preset N wave beam output signals; the N wave beams are preset wave beams with different directions by taking the microphone array as a reference, and N is more than or equal to 2;

The determining the direction of arrival DOA of the target sound source according to the pixel position of the target sound source in the image comprises the following steps:

2. The method of claim 1, wherein the determining a pixel distance between adjacent faces in the image comprises:

determining pixel positions of center points of adjacent faces in the image;

3. The method according to claim 1 or 2, wherein before determining the direction of arrival DOA of the target sound source according to the pixel position of the target sound source in the image, further comprising:

4. A method according to claim 3, wherein said determining the pixel position of the target sound source in the image from the pixel position of the center point of each of the adjacent faces of the image comprises:

5. The method according to claim 1 or 2, wherein the extracting the speech output signal of the target sound source according to the DOA and the preset N beams of output signals comprises:

According to the DOA, determining weights corresponding to the N wave beams;

and acquiring the voice output signals of the target sound source according to the output signals of the N wave beams.

6. The method of claim 5, wherein the obtaining the speech output signal of the target sound source from the output signals of the N beams comprises:

7. The method of claim 5, wherein the DOA includes a pitch angle and an azimuth angle of the target sound source, and wherein determining weights for the N beams according to the DOA comprises:

8. The method of claim 6, wherein before determining the probability that the target beam exists in the speech of the target sound source according to the output signals of the target beam corresponding to the target sound source and the output signals of the N beams, further comprises:

9. A speech extraction apparatus, comprising:

a determining module for

Determining a pixel distance between adjacent faces in the image;

the processing module is used for extracting a voice output signal of a target sound source according to the DOA and preset N wave beam output signals; the N wave beams are preset wave beams with different directions by taking the microphone array as a reference, and N is more than or equal to 2;

The determining module is specifically configured to determine, according to a pixel position of the target sound source in the image, a distance between a lens in the image acquisition assembly and an image sensor, a pixel position of a center point of the lens in the image, and a distance between adjacent photosensitive elements in the image sensor, a DOA of the target sound source.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-8.

11. An electronic device, comprising:

A processor, a microphone array, and an image acquisition assembly;

the image acquisition component is used for acquiring images;

The microphone array is used for receiving voice signals;

The processor is configured to perform the method of any of claims 1-8.