CN113450769A

CN113450769A - Voice extraction method, device, equipment and storage medium

Info

Publication number: CN113450769A
Application number: CN202010158648.7A
Authority: CN
Inventors: 童仁杰
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2021-09-28
Anticipated expiration: 2040-03-09
Also published as: CN113450769B

Abstract

The invention provides a voice extraction method, a voice extraction device, voice extraction equipment and a storage medium. The method comprises the following steps: acquiring an image at a target sound source; determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2. According to the embodiment of the invention, under the condition that the signal-to-noise ratio of the voice signal is low, especially under the condition of a long-distance private message, the DOA of the target sound source is determined according to the information of the image at the target sound source, so that the accuracy of DOA estimation can be improved, and the quality of the extracted voice signal can be further improved.

Description

Voice extraction method, device, equipment and storage medium

Technical Field

The present invention relates to the field of audio signal processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting speech.

Background

At present, the application requirement of remote pickup is very wide. For example, secret pickup is required in some monitoring scenarios. However, the current remote sound pickup technology cannot achieve the effect of short-distance sound pickup.

In the related art, a microphone array technology is used to design fixed beams pointing to multiple azimuth angles and track the energy minimum in each beam. And (4) integrating the minimum value tracking result of each beam energy to detect the target beam where the sound source is located. Then, a beam forming algorithm is used to suppress environmental noise so as to extract a voice output signal, however, in a long-distance and low signal-to-noise ratio scenario, an error is likely to occur in the estimation of the target beam only according to the minimum value of each beam energy, and the quality of the extracted voice output signal is not high.

Disclosure of Invention

The invention provides a voice extraction method, a voice extraction device, voice extraction equipment and a storage medium, which are used for improving voice extraction quality.

In a first aspect, the present invention provides a speech extraction method, including:

acquiring an image at a target sound source;

determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image;

extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.

In a second aspect, the present invention provides a speech extraction apparatus, comprising:

the acquisition module is used for acquiring an image at a target sound source;

the determining module is used for determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image;

the processing module is used for extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.

In a third aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method described in any one of the first aspect.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of the first aspects via execution of the executable instructions.

The voice extraction method, the voice extraction device, the voice extraction equipment and the storage medium provided by the embodiment of the invention are used for acquiring the image of a target sound source; determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset by taking the microphone array as a reference and are different in direction, N is larger than or equal to 2, under the condition that the signal-to-noise ratio of the voice signals is low, particularly under the condition of a long-distance private message, the DOA of the target sound source is determined according to the information of the image of the target sound source, the accuracy of DOA estimation can be improved, the voice output signals of the target sound source are extracted according to the DOA, and the quality of the extracted voice signals can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a principle implementation provided by an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a speech extraction method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of beam forming according to an embodiment of the method provided by the present invention;

FIG. 4 is a schematic diagram of the imaging principle of an embodiment of the method provided by the present invention;

FIG. 5 is a schematic flow chart of another embodiment of the method provided by the present invention;

FIG. 6 is a schematic structural diagram of an embodiment of a speech extraction apparatus provided in the present invention;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms "comprising" and "having," and any variations thereof, in the description and claims of this invention and the drawings described herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

First, the nouns and application scenarios related to the present invention are introduced:

microphone array: a plurality of microphones arranged in a geometric shape. Each microphone is generally non-directional and there is good uniformity in frequency response between microphones.

Direction Of Arrival (DOA): the direction of the plane wave reaching the microphone array. The position of the radiation source is estimated by measuring the direction of arrival of the radiation signal.

Beam forming: and carrying out weighted summation on the audio signals output by the plurality of microphones to obtain an enhanced voice signal.

Scattering noise: noise fields of equal power in each direction.

Voice Activity Detection (VAD) algorithm: it is detected whether a certain piece of audio contains human voice activity.

The method provided by the embodiment of the invention is applied to an intelligent monitoring system, for example, the sound is monitored, especially under the scene of remote private messages, so that the quality of voice extraction is improved. The monitoring system can comprise an image acquisition component, a sound acquisition component and a processor chip, wherein the image acquisition component, the sound acquisition component and the processor chip can be integrated on one device or a plurality of devices.

Wherein, image acquisition subassembly includes for example: lens, image sensor, sound collection component can be a microphone array, including at least two microphones, for example. The arrangement of the microphone array may be set according to requirements, such as a ring shape, a polygon shape, a spiral shape, etc.

As shown in fig. 1, the image capturing component is, for example, a camera 1, the microphone array includes 4 microphones 2, the microphone array is arranged in a circular array, and the camera and the microphone array are fixed by a fixing component 3.

According to the method, the current scene mode is determined through the collected image data, if the scene mode is in the private message mode, the DOA is estimated according to the image data, the accuracy is high, and then the voice output signals of the target sound source are extracted through the output signals of the beams with different directions according to the estimated DOA.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 2 is a flowchart illustrating a speech extraction method according to an embodiment of the present invention. As shown in fig. 1, the method provided by this embodiment includes:

step 101, acquiring an image at a target sound source.

According to the image information of image acquisition subassembly collection, judge whether someone says a secret, general secret is along with more obvious limbs action characteristic, such as meeting one's head and meeting one's ears etc. at present. For example, if two adjacent faces are detected in the image at a very close distance, the possibility that the two people will speak a whisper is high.

The image acquisition assembly can acquire a plurality of images at different positions and different angles in the current scene, and a target sound source is positioned according to the acquired images.

Suppose that the face detection algorithm detects N faces in total, and the pixel positions of the center points of the face regions are (x)_i,y_i) 1, 2. According to the pixel distance of the adjacent human faces in one image, whether two people are communicating with the whisper or not can be judged.

In one implementation, a pixel distance between adjacent faces in the image may be determined;

if the pixel distance is smaller than the preset threshold, the operation of step 102 is executed.

The pixel distance between adjacent faces can be determined as follows:

determining the pixel position of the central point of the adjacent face in the image;

and determining the pixel distance between the adjacent human faces according to the pixel positions of the central points of the adjacent human faces in the image.

Specifically, the pixel distance between adjacent ith and jth individual faces may be expressed as

In other embodiments, the pixel distance may be calculated by other methods, which is not limited in this application.

When the pixel distance between the center points of the areas of the adjacent faces is lower than a preset threshold epsilon, the heads of the two people can be judged to be very close, and the two people are likely to communicate in a secret manner. At this time, the silent conversation mode may be adopted for voice extraction. The decision process can be expressed as:

an indicator value of 1 indicates that the whisper mode is triggered; and if the value is 0, triggering the common mode.

And 102, determining the DOA of the target sound source according to the pixel position of the target sound source in the image.

Specifically, if it is detected that the ith and jth adjacent faces are close to each other and the whisper mode is triggered, the pixel position of the target sound source in the image may be determined as follows:

determining the pixel position of a target sound source in an image according to the pixel position of the central point of each face in adjacent faces of the image; the distance between the adjacent faces is smaller than a preset threshold value.

In one implementation, the center positions of two center points of adjacent faces of the image may be determined as the pixel positions of the target sound source in the image.

Or, in other modes, the pixel position of the central point of any one of the adjacent faces may also be used as the pixel position of the target sound source in the image. Or, the other pixel positions between the two central points of the adjacent faces are taken as the pixel positions of the target sound source in the image, which is not limited in the present application.

The pixel position (x) of the target sound source in the image can be calculated, for example, by the following formula_s,y_s)：

(x_s,y_s)＝((x_i+x_j)/2,(y_i+y_j)/2)

And determining the DOA of the target sound source according to the pixel position of the target sound source in the image, for example, determining the DOA of the target sound source according to the geometric relationship between the pixel position of the target sound source in the image and the spatial position of the target sound source by using the imaging principle.

In summary, in the private speech mode, the speech signal-to-noise ratio is low, and if a common sound source localization algorithm is used to estimate the DOA, the result is inaccurate.

103, extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.

Specifically, after the DOA is determined, noise suppression may be achieved through beamforming based on the DOA result, and a speech output signal may be extracted.

As shown in FIG. 3, the space may be divided into N regions according to the azimuth, the center angle of the region being

The value of l is 1-N, N is, for example, 6, the microphone array in fig. 3 includes four microphones, and a ring array is adopted to determine the target beams corresponding to the target sound source by determining the weights corresponding to the N beams, and extract the voice output signals.

The method of the embodiment acquires an image at a target sound source; determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset by taking the microphone array as a reference and are different in direction, N is larger than or equal to 2, under the condition that the signal-to-noise ratio of the voice signals is low, particularly under the condition of a long-distance private message, the DOA of the target sound source is determined according to the information of the image of the target sound source, the accuracy of DOA estimation can be improved, the voice output signals of the target sound source are extracted according to the DOA, and the quality of the extracted voice signals can be improved.

On the basis of the above embodiment, in another embodiment, the step 102 of determining the DOA of the target sound source may be implemented by:

and determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between a lens in the image acquisition assembly and the image sensor, the pixel position of the central point of the lens in the image and the distance between adjacent photosensitive elements in the image sensor.

As shown in fig. 4, assuming that the distance between the lens and the image sensor is f1, the pixel position corresponding to the center point of the lens is (x)₀,y₀) And the distance between the adjacent photosensitive elements is Δ d, the pitch angle corresponding to the target sound source can be calculated by the following formula:

in other embodiments, other deformation calculation of the formula can be used, and the azimuth angle of the target sound source can be obtained by adopting a similar method

Thus obtaining DOA.

As shown in fig. 5, audio-based DOA estimation is susceptible to signal-to-noise ratio and other factors, while video signals are not susceptible to signal-to-noise ratio of speech. Therefore, if the normal mode is triggered (the pixel distance between adjacent faces in the image is greater than the preset threshold), which indicates that the signal-to-noise ratio of the speech signal is high, the DOA of the target sound source can be obtained by using a conventional sound source localization algorithm (such as SRP, MUSIC, etc.). If the whisper mode is triggered (the pixel distance between adjacent faces in the image is smaller than the preset threshold), which indicates that the signal-to-noise ratio is low, the DOA of the sound source may be estimated by using the method in the above embodiment, that is, the DOA corresponding to the target sound source in the real space is obtained according to the pixel information of the image.

In a common mode, the signal-to-noise ratio of a voice signal is higher, and VAD is used for judging whether voice activity exists or not; if no voice exists, outputting an original waveform acquired by the microphone array; otherwise, estimating DOA according to a sound source positioning algorithm and constructing a beam former, suppressing environmental noise and extracting a voice output signal.

Further, the weights of the N beams are obtained according to the noise distribution and the DOA information, and the voice output signal of the target sound source is extracted.

In one embodiment, step 103 may be implemented as follows:

determining weights corresponding to the N wave beams according to the DOA;

determining output signals of the N wave beams according to weights corresponding to the N wave beams and voice signals received by the microphone array;

and acquiring a voice output signal of the target sound source according to the output signals of the N wave beams.

Specifically, the weights of the N beams are calculated according to the DOA value calculated in the foregoing embodiment, and the target voice is extracted. In the following, a possible implementation of extracting a speech signal is described by taking scattering noise as an example.

The specific process is detailed as follows:

the scattering noise is uniformly distributed in space, which means that: and noise power in all directions is equal by taking the microphone array as a reference point. Assuming that the number of microphones is M, for a diffuse noise field, the correlation coefficient for channel i and channel j at frequency f can be calculated as:

l_ijdenotes the linear distance of the channels i and j, c denotes the speed of sound, Ω_ij(f) Represents the corresponding elements of the covariance matrix omega (f) in the ith row and jth column. Wherein, the channel i is a channel corresponding to the ith microphone; the channel j is a channel corresponding to the jth microphone.

In one embodiment, weights corresponding to the N beams are determined according to the covariance matrix and steering vectors corresponding to the N beams.

The steering vectors corresponding to the N beams may be determined according to a pitch angle included in the DOA and a central azimuth angle of a spatial region corresponding to each of the N beams.

Suppose a certain beam corresponds to a DOA of

In particular by

The time delay of each microphone in the microphone array relative to the reference microphone is calculated.

The steering vectors for the N beams may be:

wherein

Which represents the delay of the ith microphone relative to the reference microphone, is uniquely determined by the sound source orientation and the array shape. The reference microphone is one of the M microphones, for example, the microphone of the voice signal received first.

Weight w_l(f) This can be calculated, for example, by the following formula:

will w_l(f) The target voice enhancement and the environmental noise suppression can be realized by acting on the input multi-channel audio data (namely, the voice signal vectors received by the M microphones).

Assuming that at a certain time frequency point (t, f), the vector of the received voice signal of the microphone array is x (t, f), and the output signals of the N beams are respectively represented as y_l(t,f)＝w_l(f) x (t, f), l ═ 1, 2. Then, the voice output signal of the target sound source is obtained through the output signals of the N beams, for example, the target beam of the N beams is determined through the DOA of the target sound source, and then the output signal of the target beam is obtained. Further, the output signal of the target beam may be enhanced, for example, multiplied by a certain gain, which may be a fixed preset value, or calculated by other means.

Here, x (t, f) may be a signal vector obtained by performing frequency domain transform processing by framing, for example, by short-time fourier transform.

In the method of the embodiment, due to the accuracy of DOA estimation, weights corresponding to N beams are determined according to the DOA; determining output signals of the N wave beams according to weights corresponding to the N wave beams and voice signals received by the microphone array; and acquiring the voice output signal of the target sound source according to the output signals of the N wave beams, thereby improving the quality of the extracted voice signal.

The above scheme is low in complexity and easy to implement, but has a high requirement on DOA estimation accuracy, and in other embodiments, to improve the stability of the algorithm, the weights may be determined in the following manner:

determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the steering vectors corresponding to the N wave beams; the covariance matrix represents the covariance matrix of the microphone array based on the scattered noise at frequency f.

Can be determined, for example, by the following formula

Ω_ε(f)＝Ω(f)+ε·I

Wherein, the diagonal loading coefficient epsilon controls the white noise gain and beam width of the beam former. Considering such factors as DOA errors and microphone mismatch, epsilon needs to be chosen so that the beam has good white noise gain and appropriate beam width. Epsilon may be determined according to actual requirements.

According to the azimuth

Space can be divided into N_bA region having a central angle of

Respectively correspond to N_bBeam former

Assuming that the received signal vector of the microphone array is x (t, f) at a certain time frequency point (t, f), N_bThe output of each beam can be represented as y_i(t,f)＝w_i(f)x(t,f),i＝1,2,...,N_b。

In an embodiment, in order to improve the quality of the extracted target speech, the following method may be adopted:

determining the probability of the target wave beam having the voice of the target sound source according to the output signal of the target wave beam corresponding to the target sound source and the output signals of the N wave beams; the target beam is one of the N beams;

determining a third post-processing gain according to the first post-processing gain of the target wave beam with voice, the second post-processing gain of the target wave beam without voice and the probability;

and determining a voice output signal of the target sound source according to the third post-processing gain.

Specifically, the second post-processing gain, which is assumed to be absent of speech in the target beam, is a preset fixed value G_minThe first post-processing gain for speech presence is G_sWherein G is_sCan be obtained by a classical noise reduction algorithm, the total third post-processing gain can be calculated as:

where p represents the probability that the target beam has the voice of the target sound source.

Generally, if the energy value of a certain beam is large, the possibility that the voice of the target sound source is located in the beam is large, that is, the correlation between the voice existence probability p (t, f) and the energy of each beam is extremely high. Assuming DOA values of a target sound source

Corresponding to a target beam having an azimuth of

Then, at the time frequency point (t, f), the probability that the target beam has voice can be calculated by the following formula:

the final speech output signal may be y_o(t,f)＝G·y_s(t,f)。

According to the method, the probability that the target sound source voice exists in the target wave beam is determined according to the output signal of the target wave beam corresponding to the target sound source and the output signals of the N wave beams; the target beam is one of the N beams; determining a third post-processing gain according to the first post-processing gain of the target wave beam with voice, the second post-processing gain of the target wave beam without voice and the probability; and determining the voice output signal of the target sound source according to the third post-processing gain, so that the quality of the extracted voice signal can be further improved.

Fig. 6 is a structural diagram of an embodiment of a speech extraction device provided in the present invention, and as shown in fig. 6, the speech extraction device of the embodiment includes:

an obtaining module 601, configured to obtain an image at a target sound source;

a determining module 602, configured to determine, according to a pixel position of the target sound source in the image, a direction of arrival DOA of the target sound source;

a processing module 603, configured to extract a voice output signal of a target sound source according to the DOA and preset output signals of N beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.

In a possible implementation manner, the determining module 602 is specifically configured to:

determining the pixel distance between adjacent human faces in the image;

and if the pixel distance is smaller than a preset threshold value, determining the DOA operation of the target sound source according to the pixel position of the target sound source in the image.

In a possible implementation manner, the determining module 602 is further configured to:

determining the pixel position of the target sound source in the image according to the pixel position of the central point of each face in the adjacent faces of the image; and the distance between the adjacent faces is smaller than a preset threshold value.

and determining the central positions of two central points of adjacent human faces of the image as the pixel positions of the target sound source in the image.

and determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between a lens and an image sensor in an image acquisition assembly, the pixel position of the central point of the lens in the image and the distance between adjacent photosensitive elements in the image sensor.

In a possible implementation manner, the processing module 603 is specifically configured to:

determining weights corresponding to the N wave beams according to the DOA;

and acquiring a voice output signal of a target sound source according to the output signals of the N wave beams.

determining the probability of the target beam having the voice of the target sound source according to the output signal of the target beam corresponding to the target sound source and the output signals of the N beams; the target beam is one of the N beams;

determining a third post-processing gain according to the first post-processing gain of the target beam with voice, the second post-processing gain of the target beam without voice and the probability;

determining steering vectors corresponding to the N wave beams according to the pitch angle included by the DOA and the central azimuth angles of the space regions corresponding to the N wave beams respectively;

determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the steering vectors corresponding to the N wave beams; the covariance matrix represents the covariance matrix of the microphone array based on the scattering noise with the frequency point f.

In a possible implementation manner, the processing module 603 is configured to:

determining a target beam corresponding to the target sound source according to the azimuth included by the DOA and the central azimuth of the spatial region corresponding to each of the N beams;

and determining an output signal of a target beam corresponding to the target sound source according to the weight corresponding to the target beam corresponding to the target sound source and the voice signal received by the microphone array.

The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 7 is a structural diagram of an embodiment of an electronic device provided in the present invention, and as shown in fig. 7, the electronic device includes:

a processor 701, a microphone array 702, and an image acquisition component 703, wherein optionally, a memory storing executable instructions of the processor 701 may be further included.

The image acquisition component 703 is used to acquire images. Microphone array 702 is used to collect speech signals.

The above components may communicate over one or more buses.

The processor 701 is configured to execute the corresponding method in the foregoing method embodiment by executing the executable instruction, and the specific implementation process of the method may refer to the foregoing method embodiment, which is not described herein again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method in the foregoing method embodiment is implemented.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech extraction, comprising:

acquiring an image at a target sound source;

2. The method according to claim 1, wherein before determining the DOA of the target sound source according to the pixel position of the target sound source in the image, further comprising:

determining the pixel distance between adjacent human faces in the image;

3. The method of claim 2, wherein determining the pixel distance between adjacent faces in the image comprises:

4. The method according to any one of claims 1-3, wherein before determining the DOA of the target sound source according to the pixel position of the target sound source in the image, further comprising:

5. The method according to claim 4, wherein the determining the pixel position of the target sound source in the image according to the pixel position of the central point of each face in the adjacent faces of the image comprises:

6. The method according to any one of claims 1-3, wherein said determining the DOA of the target sound source based on the pixel position of the target sound source in the image comprises:

7. A method according to any of claims 1-3, wherein said extracting a speech output signal of a target sound source from said DOA and preset output signals of N beams comprises:

determining weights corresponding to the N wave beams according to the DOA;

8. The method according to claim 7, wherein said obtaining the voice output signal of the target sound source according to the output signals of the N beams comprises:

9. The method of claim 7 wherein the DOA comprises a pitch angle and an azimuth angle of the target sound source, and wherein determining the weights for the N beams based on the DOA comprises:

10. The method according to claim 8, wherein determining the probability that the voice of the target sound source is in the target beam according to the output signal of the target beam corresponding to the target sound source and the output signals of the N beams further comprises:

11. A speech extraction device, comprising:

the acquisition module is used for acquiring an image at a target sound source;

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.

13. An electronic device, comprising:

the system comprises a processor, a microphone array and an image acquisition assembly;

the image acquisition assembly is used for acquiring an image;

the microphone array is used for receiving a voice signal;

the processor is configured to perform the method of any one of claims 1-10.