CN113450769A - Voice extraction method, device, equipment and storage medium - Google Patents

Voice extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113450769A
CN113450769A CN202010158648.7A CN202010158648A CN113450769A CN 113450769 A CN113450769 A CN 113450769A CN 202010158648 A CN202010158648 A CN 202010158648A CN 113450769 A CN113450769 A CN 113450769A
Authority
CN
China
Prior art keywords
sound source
target sound
image
determining
doa
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010158648.7A
Other languages
Chinese (zh)
Other versions
CN113450769B (en
Inventor
童仁杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202010158648.7A priority Critical patent/CN113450769B/en
Publication of CN113450769A publication Critical patent/CN113450769A/en
Application granted granted Critical
Publication of CN113450769B publication Critical patent/CN113450769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention provides a voice extraction method, a voice extraction device, voice extraction equipment and a storage medium. The method comprises the following steps: acquiring an image at a target sound source; determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2. According to the embodiment of the invention, under the condition that the signal-to-noise ratio of the voice signal is low, especially under the condition of a long-distance private message, the DOA of the target sound source is determined according to the information of the image at the target sound source, so that the accuracy of DOA estimation can be improved, and the quality of the extracted voice signal can be further improved.

Description

Voice extraction method, device, equipment and storage medium
Technical Field
The present invention relates to the field of audio signal processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting speech.
Background
At present, the application requirement of remote pickup is very wide. For example, secret pickup is required in some monitoring scenarios. However, the current remote sound pickup technology cannot achieve the effect of short-distance sound pickup.
In the related art, a microphone array technology is used to design fixed beams pointing to multiple azimuth angles and track the energy minimum in each beam. And (4) integrating the minimum value tracking result of each beam energy to detect the target beam where the sound source is located. Then, a beam forming algorithm is used to suppress environmental noise so as to extract a voice output signal, however, in a long-distance and low signal-to-noise ratio scenario, an error is likely to occur in the estimation of the target beam only according to the minimum value of each beam energy, and the quality of the extracted voice output signal is not high.
Disclosure of Invention
The invention provides a voice extraction method, a voice extraction device, voice extraction equipment and a storage medium, which are used for improving voice extraction quality.
In a first aspect, the present invention provides a speech extraction method, including:
acquiring an image at a target sound source;
determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image;
extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.
In a second aspect, the present invention provides a speech extraction apparatus, comprising:
the acquisition module is used for acquiring an image at a target sound source;
the determining module is used for determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image;
the processing module is used for extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.
In a third aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method described in any one of the first aspect.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of the first aspects via execution of the executable instructions.
The voice extraction method, the voice extraction device, the voice extraction equipment and the storage medium provided by the embodiment of the invention are used for acquiring the image of a target sound source; determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset by taking the microphone array as a reference and are different in direction, N is larger than or equal to 2, under the condition that the signal-to-noise ratio of the voice signals is low, particularly under the condition of a long-distance private message, the DOA of the target sound source is determined according to the information of the image of the target sound source, the accuracy of DOA estimation can be improved, the voice output signals of the target sound source are extracted according to the DOA, and the quality of the extracted voice signals can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a schematic diagram of a principle implementation provided by an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a speech extraction method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of beam forming according to an embodiment of the method provided by the present invention;
FIG. 4 is a schematic diagram of the imaging principle of an embodiment of the method provided by the present invention;
FIG. 5 is a schematic flow chart of another embodiment of the method provided by the present invention;
FIG. 6 is a schematic structural diagram of an embodiment of a speech extraction apparatus provided in the present invention;
fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.
With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terms "comprising" and "having," and any variations thereof, in the description and claims of this invention and the drawings described herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
First, the nouns and application scenarios related to the present invention are introduced:
microphone array: a plurality of microphones arranged in a geometric shape. Each microphone is generally non-directional and there is good uniformity in frequency response between microphones.
Direction Of Arrival (DOA): the direction of the plane wave reaching the microphone array. The position of the radiation source is estimated by measuring the direction of arrival of the radiation signal.
Beam forming: and carrying out weighted summation on the audio signals output by the plurality of microphones to obtain an enhanced voice signal.
Scattering noise: noise fields of equal power in each direction.
Voice Activity Detection (VAD) algorithm: it is detected whether a certain piece of audio contains human voice activity.
The method provided by the embodiment of the invention is applied to an intelligent monitoring system, for example, the sound is monitored, especially under the scene of remote private messages, so that the quality of voice extraction is improved. The monitoring system can comprise an image acquisition component, a sound acquisition component and a processor chip, wherein the image acquisition component, the sound acquisition component and the processor chip can be integrated on one device or a plurality of devices.
Wherein, image acquisition subassembly includes for example: lens, image sensor, sound collection component can be a microphone array, including at least two microphones, for example. The arrangement of the microphone array may be set according to requirements, such as a ring shape, a polygon shape, a spiral shape, etc.
As shown in fig. 1, the image capturing component is, for example, a camera 1, the microphone array includes 4 microphones 2, the microphone array is arranged in a circular array, and the camera and the microphone array are fixed by a fixing component 3.
According to the method, the current scene mode is determined through the collected image data, if the scene mode is in the private message mode, the DOA is estimated according to the image data, the accuracy is high, and then the voice output signals of the target sound source are extracted through the output signals of the beams with different directions according to the estimated DOA.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a flowchart illustrating a speech extraction method according to an embodiment of the present invention. As shown in fig. 1, the method provided by this embodiment includes:
step 101, acquiring an image at a target sound source.
According to the image information of image acquisition subassembly collection, judge whether someone says a secret, general secret is along with more obvious limbs action characteristic, such as meeting one's head and meeting one's ears etc. at present. For example, if two adjacent faces are detected in the image at a very close distance, the possibility that the two people will speak a whisper is high.
The image acquisition assembly can acquire a plurality of images at different positions and different angles in the current scene, and a target sound source is positioned according to the acquired images.
Suppose that the face detection algorithm detects N faces in total, and the pixel positions of the center points of the face regions are (x)i,yi) 1, 2. According to the pixel distance of the adjacent human faces in one image, whether two people are communicating with the whisper or not can be judged.
In one implementation, a pixel distance between adjacent faces in the image may be determined;
if the pixel distance is smaller than the preset threshold, the operation of step 102 is executed.
The pixel distance between adjacent faces can be determined as follows:
determining the pixel position of the central point of the adjacent face in the image;
and determining the pixel distance between the adjacent human faces according to the pixel positions of the central points of the adjacent human faces in the image.
Specifically, the pixel distance between adjacent ith and jth individual faces may be expressed as
Figure BDA0002404965860000051
In other embodiments, the pixel distance may be calculated by other methods, which is not limited in this application.
When the pixel distance between the center points of the areas of the adjacent faces is lower than a preset threshold epsilon, the heads of the two people can be judged to be very close, and the two people are likely to communicate in a secret manner. At this time, the silent conversation mode may be adopted for voice extraction. The decision process can be expressed as:
Figure BDA0002404965860000052
an indicator value of 1 indicates that the whisper mode is triggered; and if the value is 0, triggering the common mode.
And 102, determining the DOA of the target sound source according to the pixel position of the target sound source in the image.
Specifically, if it is detected that the ith and jth adjacent faces are close to each other and the whisper mode is triggered, the pixel position of the target sound source in the image may be determined as follows:
determining the pixel position of a target sound source in an image according to the pixel position of the central point of each face in adjacent faces of the image; the distance between the adjacent faces is smaller than a preset threshold value.
In one implementation, the center positions of two center points of adjacent faces of the image may be determined as the pixel positions of the target sound source in the image.
Or, in other modes, the pixel position of the central point of any one of the adjacent faces may also be used as the pixel position of the target sound source in the image. Or, the other pixel positions between the two central points of the adjacent faces are taken as the pixel positions of the target sound source in the image, which is not limited in the present application.
The pixel position (x) of the target sound source in the image can be calculated, for example, by the following formulas,ys):
(xs,ys)=((xi+xj)/2,(yi+yj)/2)
And determining the DOA of the target sound source according to the pixel position of the target sound source in the image, for example, determining the DOA of the target sound source according to the geometric relationship between the pixel position of the target sound source in the image and the spatial position of the target sound source by using the imaging principle.
In summary, in the private speech mode, the speech signal-to-noise ratio is low, and if a common sound source localization algorithm is used to estimate the DOA, the result is inaccurate.
103, extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.
Specifically, after the DOA is determined, noise suppression may be achieved through beamforming based on the DOA result, and a speech output signal may be extracted.
As shown in FIG. 3, the space may be divided into N regions according to the azimuth, the center angle of the region being
Figure BDA0002404965860000063
The value of l is 1-N, N is, for example, 6, the microphone array in fig. 3 includes four microphones, and a ring array is adopted to determine the target beams corresponding to the target sound source by determining the weights corresponding to the N beams, and extract the voice output signals.
The method of the embodiment acquires an image at a target sound source; determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image; extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset by taking the microphone array as a reference and are different in direction, N is larger than or equal to 2, under the condition that the signal-to-noise ratio of the voice signals is low, particularly under the condition of a long-distance private message, the DOA of the target sound source is determined according to the information of the image of the target sound source, the accuracy of DOA estimation can be improved, the voice output signals of the target sound source are extracted according to the DOA, and the quality of the extracted voice signals can be improved.
On the basis of the above embodiment, in another embodiment, the step 102 of determining the DOA of the target sound source may be implemented by:
and determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between a lens in the image acquisition assembly and the image sensor, the pixel position of the central point of the lens in the image and the distance between adjacent photosensitive elements in the image sensor.
As shown in fig. 4, assuming that the distance between the lens and the image sensor is f1, the pixel position corresponding to the center point of the lens is (x)0,y0) And the distance between the adjacent photosensitive elements is Δ d, the pitch angle corresponding to the target sound source can be calculated by the following formula:
Figure BDA0002404965860000061
in other embodiments, other deformation calculation of the formula can be used, and the azimuth angle of the target sound source can be obtained by adopting a similar method
Figure BDA0002404965860000062
Thus obtaining DOA.
As shown in fig. 5, audio-based DOA estimation is susceptible to signal-to-noise ratio and other factors, while video signals are not susceptible to signal-to-noise ratio of speech. Therefore, if the normal mode is triggered (the pixel distance between adjacent faces in the image is greater than the preset threshold), which indicates that the signal-to-noise ratio of the speech signal is high, the DOA of the target sound source can be obtained by using a conventional sound source localization algorithm (such as SRP, MUSIC, etc.). If the whisper mode is triggered (the pixel distance between adjacent faces in the image is smaller than the preset threshold), which indicates that the signal-to-noise ratio is low, the DOA of the sound source may be estimated by using the method in the above embodiment, that is, the DOA corresponding to the target sound source in the real space is obtained according to the pixel information of the image.
In a common mode, the signal-to-noise ratio of a voice signal is higher, and VAD is used for judging whether voice activity exists or not; if no voice exists, outputting an original waveform acquired by the microphone array; otherwise, estimating DOA according to a sound source positioning algorithm and constructing a beam former, suppressing environmental noise and extracting a voice output signal.
Further, the weights of the N beams are obtained according to the noise distribution and the DOA information, and the voice output signal of the target sound source is extracted.
In one embodiment, step 103 may be implemented as follows:
determining weights corresponding to the N wave beams according to the DOA;
determining output signals of the N wave beams according to weights corresponding to the N wave beams and voice signals received by the microphone array;
and acquiring a voice output signal of the target sound source according to the output signals of the N wave beams.
Specifically, the weights of the N beams are calculated according to the DOA value calculated in the foregoing embodiment, and the target voice is extracted. In the following, a possible implementation of extracting a speech signal is described by taking scattering noise as an example.
The specific process is detailed as follows:
the scattering noise is uniformly distributed in space, which means that: and noise power in all directions is equal by taking the microphone array as a reference point. Assuming that the number of microphones is M, for a diffuse noise field, the correlation coefficient for channel i and channel j at frequency f can be calculated as:
Figure BDA0002404965860000071
lijdenotes the linear distance of the channels i and j, c denotes the speed of sound, Ωij(f) Represents the corresponding elements of the covariance matrix omega (f) in the ith row and jth column. Wherein, the channel i is a channel corresponding to the ith microphone; the channel j is a channel corresponding to the jth microphone.
In one embodiment, weights corresponding to the N beams are determined according to the covariance matrix and steering vectors corresponding to the N beams.
The steering vectors corresponding to the N beams may be determined according to a pitch angle included in the DOA and a central azimuth angle of a spatial region corresponding to each of the N beams.
Suppose a certain beam corresponds to a DOA of
Figure BDA0002404965860000081
In particular by
Figure BDA0002404965860000082
The time delay of each microphone in the microphone array relative to the reference microphone is calculated.
The steering vectors for the N beams may be:
Figure BDA0002404965860000083
wherein
Figure BDA0002404965860000084
Which represents the delay of the ith microphone relative to the reference microphone, is uniquely determined by the sound source orientation and the array shape. The reference microphone is one of the M microphones, for example, the microphone of the voice signal received first.
Weight wl(f) This can be calculated, for example, by the following formula:
Figure BDA0002404965860000085
will wl(f) The target voice enhancement and the environmental noise suppression can be realized by acting on the input multi-channel audio data (namely, the voice signal vectors received by the M microphones).
Assuming that at a certain time frequency point (t, f), the vector of the received voice signal of the microphone array is x (t, f), and the output signals of the N beams are respectively represented as yl(t,f)=wl(f) x (t, f), l ═ 1, 2. Then, the voice output signal of the target sound source is obtained through the output signals of the N beams, for example, the target beam of the N beams is determined through the DOA of the target sound source, and then the output signal of the target beam is obtained. Further, the output signal of the target beam may be enhanced, for example, multiplied by a certain gain, which may be a fixed preset value, or calculated by other means.
Here, x (t, f) may be a signal vector obtained by performing frequency domain transform processing by framing, for example, by short-time fourier transform.
In the method of the embodiment, due to the accuracy of DOA estimation, weights corresponding to N beams are determined according to the DOA; determining output signals of the N wave beams according to weights corresponding to the N wave beams and voice signals received by the microphone array; and acquiring the voice output signal of the target sound source according to the output signals of the N wave beams, thereby improving the quality of the extracted voice signal.
The above scheme is low in complexity and easy to implement, but has a high requirement on DOA estimation accuracy, and in other embodiments, to improve the stability of the algorithm, the weights may be determined in the following manner:
determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the steering vectors corresponding to the N wave beams; the covariance matrix represents the covariance matrix of the microphone array based on the scattered noise at frequency f.
Can be determined, for example, by the following formula
Figure BDA0002404965860000091
Ωε(f)=Ω(f)+ε·I
Wherein, the diagonal loading coefficient epsilon controls the white noise gain and beam width of the beam former. Considering such factors as DOA errors and microphone mismatch, epsilon needs to be chosen so that the beam has good white noise gain and appropriate beam width. Epsilon may be determined according to actual requirements.
According to the azimuth
Figure BDA0002404965860000092
Space can be divided into NbA region having a central angle of
Figure BDA0002404965860000093
Respectively correspond to NbBeam former
Figure BDA0002404965860000094
Assuming that the received signal vector of the microphone array is x (t, f) at a certain time frequency point (t, f), NbThe output of each beam can be represented as yi(t,f)=wi(f)x(t,f),i=1,2,...,Nb
In an embodiment, in order to improve the quality of the extracted target speech, the following method may be adopted:
determining the probability of the target wave beam having the voice of the target sound source according to the output signal of the target wave beam corresponding to the target sound source and the output signals of the N wave beams; the target beam is one of the N beams;
determining a third post-processing gain according to the first post-processing gain of the target wave beam with voice, the second post-processing gain of the target wave beam without voice and the probability;
and determining a voice output signal of the target sound source according to the third post-processing gain.
Specifically, the second post-processing gain, which is assumed to be absent of speech in the target beam, is a preset fixed value GminThe first post-processing gain for speech presence is GsWherein G issCan be obtained by a classical noise reduction algorithm, the total third post-processing gain can be calculated as:
Figure BDA0002404965860000095
where p represents the probability that the target beam has the voice of the target sound source.
Generally, if the energy value of a certain beam is large, the possibility that the voice of the target sound source is located in the beam is large, that is, the correlation between the voice existence probability p (t, f) and the energy of each beam is extremely high. Assuming DOA values of a target sound source
Figure BDA0002404965860000096
Corresponding to a target beam having an azimuth of
Figure BDA0002404965860000097
Then, at the time frequency point (t, f), the probability that the target beam has voice can be calculated by the following formula:
Figure BDA0002404965860000098
the final speech output signal may be yo(t,f)=G·ys(t,f)。
According to the method, the probability that the target sound source voice exists in the target wave beam is determined according to the output signal of the target wave beam corresponding to the target sound source and the output signals of the N wave beams; the target beam is one of the N beams; determining a third post-processing gain according to the first post-processing gain of the target wave beam with voice, the second post-processing gain of the target wave beam without voice and the probability; and determining the voice output signal of the target sound source according to the third post-processing gain, so that the quality of the extracted voice signal can be further improved.
Fig. 6 is a structural diagram of an embodiment of a speech extraction device provided in the present invention, and as shown in fig. 6, the speech extraction device of the embodiment includes:
an obtaining module 601, configured to obtain an image at a target sound source;
a determining module 602, configured to determine, according to a pixel position of the target sound source in the image, a direction of arrival DOA of the target sound source;
a processing module 603, configured to extract a voice output signal of a target sound source according to the DOA and preset output signals of N beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.
In a possible implementation manner, the determining module 602 is specifically configured to:
determining the pixel distance between adjacent human faces in the image;
and if the pixel distance is smaller than a preset threshold value, determining the DOA operation of the target sound source according to the pixel position of the target sound source in the image.
In a possible implementation manner, the determining module 602 is specifically configured to:
determining the pixel position of the central point of the adjacent face in the image;
and determining the pixel distance between the adjacent human faces according to the pixel positions of the central points of the adjacent human faces in the image.
In a possible implementation manner, the determining module 602 is further configured to:
determining the pixel position of the target sound source in the image according to the pixel position of the central point of each face in the adjacent faces of the image; and the distance between the adjacent faces is smaller than a preset threshold value.
In a possible implementation manner, the determining module 602 is specifically configured to:
and determining the central positions of two central points of adjacent human faces of the image as the pixel positions of the target sound source in the image.
In a possible implementation manner, the determining module 602 is specifically configured to:
and determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between a lens and an image sensor in an image acquisition assembly, the pixel position of the central point of the lens in the image and the distance between adjacent photosensitive elements in the image sensor.
In a possible implementation manner, the processing module 603 is specifically configured to:
determining weights corresponding to the N wave beams according to the DOA;
determining output signals of the N wave beams according to weights corresponding to the N wave beams and voice signals received by the microphone array;
and acquiring a voice output signal of a target sound source according to the output signals of the N wave beams.
In a possible implementation manner, the processing module 603 is specifically configured to:
determining the probability of the target beam having the voice of the target sound source according to the output signal of the target beam corresponding to the target sound source and the output signals of the N beams; the target beam is one of the N beams;
determining a third post-processing gain according to the first post-processing gain of the target beam with voice, the second post-processing gain of the target beam without voice and the probability;
and determining a voice output signal of the target sound source according to the third post-processing gain.
In a possible implementation manner, the processing module 603 is specifically configured to:
determining steering vectors corresponding to the N wave beams according to the pitch angle included by the DOA and the central azimuth angles of the space regions corresponding to the N wave beams respectively;
determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the steering vectors corresponding to the N wave beams; the covariance matrix represents the covariance matrix of the microphone array based on the scattering noise with the frequency point f.
In a possible implementation manner, the processing module 603 is configured to:
determining a target beam corresponding to the target sound source according to the azimuth included by the DOA and the central azimuth of the spatial region corresponding to each of the N beams;
and determining an output signal of a target beam corresponding to the target sound source according to the weight corresponding to the target beam corresponding to the target sound source and the voice signal received by the microphone array.
The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 7 is a structural diagram of an embodiment of an electronic device provided in the present invention, and as shown in fig. 7, the electronic device includes:
a processor 701, a microphone array 702, and an image acquisition component 703, wherein optionally, a memory storing executable instructions of the processor 701 may be further included.
The image acquisition component 703 is used to acquire images. Microphone array 702 is used to collect speech signals.
The above components may communicate over one or more buses.
The processor 701 is configured to execute the corresponding method in the foregoing method embodiment by executing the executable instruction, and the specific implementation process of the method may refer to the foregoing method embodiment, which is not described herein again.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method in the foregoing method embodiment is implemented.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. A method of speech extraction, comprising:
acquiring an image at a target sound source;
determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image;
extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.
2. The method according to claim 1, wherein before determining the DOA of the target sound source according to the pixel position of the target sound source in the image, further comprising:
determining the pixel distance between adjacent human faces in the image;
and if the pixel distance is smaller than a preset threshold value, determining the DOA operation of the target sound source according to the pixel position of the target sound source in the image.
3. The method of claim 2, wherein determining the pixel distance between adjacent faces in the image comprises:
determining the pixel position of the central point of the adjacent face in the image;
and determining the pixel distance between the adjacent human faces according to the pixel positions of the central points of the adjacent human faces in the image.
4. The method according to any one of claims 1-3, wherein before determining the DOA of the target sound source according to the pixel position of the target sound source in the image, further comprising:
determining the pixel position of the target sound source in the image according to the pixel position of the central point of each face in the adjacent faces of the image; and the distance between the adjacent faces is smaller than a preset threshold value.
5. The method according to claim 4, wherein the determining the pixel position of the target sound source in the image according to the pixel position of the central point of each face in the adjacent faces of the image comprises:
and determining the central positions of two central points of adjacent human faces of the image as the pixel positions of the target sound source in the image.
6. The method according to any one of claims 1-3, wherein said determining the DOA of the target sound source based on the pixel position of the target sound source in the image comprises:
and determining the DOA of the target sound source according to the pixel position of the target sound source in the image, the distance between a lens and an image sensor in an image acquisition assembly, the pixel position of the central point of the lens in the image and the distance between adjacent photosensitive elements in the image sensor.
7. A method according to any of claims 1-3, wherein said extracting a speech output signal of a target sound source from said DOA and preset output signals of N beams comprises:
determining weights corresponding to the N wave beams according to the DOA;
determining output signals of the N wave beams according to weights corresponding to the N wave beams and voice signals received by the microphone array;
and acquiring a voice output signal of a target sound source according to the output signals of the N wave beams.
8. The method according to claim 7, wherein said obtaining the voice output signal of the target sound source according to the output signals of the N beams comprises:
determining the probability of the target beam having the voice of the target sound source according to the output signal of the target beam corresponding to the target sound source and the output signals of the N beams; the target beam is one of the N beams;
determining a third post-processing gain according to the first post-processing gain of the target beam with voice, the second post-processing gain of the target beam without voice and the probability;
and determining a voice output signal of the target sound source according to the third post-processing gain.
9. The method of claim 7 wherein the DOA comprises a pitch angle and an azimuth angle of the target sound source, and wherein determining the weights for the N beams based on the DOA comprises:
determining steering vectors corresponding to the N wave beams according to the pitch angle included by the DOA and the central azimuth angles of the space regions corresponding to the N wave beams respectively;
determining weights corresponding to the N wave beams according to the covariance matrix after diagonal loading and the steering vectors corresponding to the N wave beams; the covariance matrix represents the covariance matrix of the microphone array based on the scattering noise with the frequency point f.
10. The method according to claim 8, wherein determining the probability that the voice of the target sound source is in the target beam according to the output signal of the target beam corresponding to the target sound source and the output signals of the N beams further comprises:
determining a target beam corresponding to the target sound source according to the azimuth included by the DOA and the central azimuth of the spatial region corresponding to each of the N beams;
and determining an output signal of a target beam corresponding to the target sound source according to the weight corresponding to the target beam corresponding to the target sound source and the voice signal received by the microphone array.
11. A speech extraction device, comprising:
the acquisition module is used for acquiring an image at a target sound source;
the determining module is used for determining the DOA (direction of arrival) of the target sound source according to the pixel position of the target sound source in the image;
the processing module is used for extracting a voice output signal of a target sound source according to the DOA and output signals of preset N wave beams; the N wave beams are preset wave beams with different directions based on the microphone array, and N is larger than or equal to 2.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.
13. An electronic device, comprising:
the system comprises a processor, a microphone array and an image acquisition assembly;
the image acquisition assembly is used for acquiring an image;
the microphone array is used for receiving a voice signal;
the processor is configured to perform the method of any one of claims 1-10.
CN202010158648.7A 2020-03-09 2020-03-09 Speech extraction method, device, equipment and storage medium Active CN113450769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010158648.7A CN113450769B (en) 2020-03-09 2020-03-09 Speech extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010158648.7A CN113450769B (en) 2020-03-09 2020-03-09 Speech extraction method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113450769A true CN113450769A (en) 2021-09-28
CN113450769B CN113450769B (en) 2024-06-25

Family

ID=77806277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010158648.7A Active CN113450769B (en) 2020-03-09 2020-03-09 Speech extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113450769B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080240463A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Enhanced Beamforming for Arrays of Directional Microphones
CN105847584A (en) * 2016-05-12 2016-08-10 歌尔声学股份有限公司 Method for intelligent device to identify private conversations
US20170345437A1 (en) * 2016-05-27 2017-11-30 Fu Tai Hua Industry (Shenzhen) Co., Ltd. Voice receiving method and device
CN107534725A (en) * 2015-05-19 2018-01-02 华为技术有限公司 A kind of audio signal processing method and device
CN108957392A (en) * 2018-04-16 2018-12-07 深圳市沃特沃德股份有限公司 Sounnd source direction estimation method and device
CN110248197A (en) * 2018-03-07 2019-09-17 杭州海康威视数字技术股份有限公司 Sound enhancement method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080240463A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Enhanced Beamforming for Arrays of Directional Microphones
CN107534725A (en) * 2015-05-19 2018-01-02 华为技术有限公司 A kind of audio signal processing method and device
CN105847584A (en) * 2016-05-12 2016-08-10 歌尔声学股份有限公司 Method for intelligent device to identify private conversations
US20170345437A1 (en) * 2016-05-27 2017-11-30 Fu Tai Hua Industry (Shenzhen) Co., Ltd. Voice receiving method and device
CN110248197A (en) * 2018-03-07 2019-09-17 杭州海康威视数字技术股份有限公司 Sound enhancement method and device
CN108957392A (en) * 2018-04-16 2018-12-07 深圳市沃特沃德股份有限公司 Sounnd source direction estimation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王义圆等: "基于麦克风阵列的语音增强与干扰抑制算法", 《电声技术》, no. 02, 5 February 2018 (2018-02-05), pages 4 - 8 *

Also Published As

Publication number Publication date
CN113450769B (en) 2024-06-25

Similar Documents

Publication Publication Date Title
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
US11172122B2 (en) User identification based on voice and face
US9734822B1 (en) Feedback based beamformed signal selection
US8233352B2 (en) Audio source localization system and method
CN111370014B (en) System and method for multi-stream target-voice detection and channel fusion
EP2847763B1 (en) Audio user interaction recognition and context refinement
US7626889B2 (en) Sensor array post-filter for tracking spatial distributions of signals and noise
US9042573B2 (en) Processing signals
US10535361B2 (en) Speech enhancement using clustering of cues
US20170164101A1 (en) Conference system with a microphone array system and a method of speech acquisition in a conference system
CN113113034A (en) Multi-source tracking and voice activity detection for planar microphone arrays
EP3566462B1 (en) Audio capture using beamforming
JP6467736B2 (en) Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN112735461B (en) Pickup method, and related device and equipment
CN112799017B (en) Sound source positioning method, sound source positioning device, storage medium and electronic equipment
CN115457971A (en) Noise reduction method, electronic device and storage medium
CN111866665A (en) Microphone array beam forming method and device
Zheng et al. BSS for improved interference estimation for blind speech signal extraction with two microphones
Nakadai et al. Footstep detection and classification using distributed microphones
CN113450769B (en) Speech extraction method, device, equipment and storage medium
CN111933182B (en) Sound source tracking method, device, equipment and storage medium
CN115472151A (en) Target voice extraction method based on video information assistance
Levin et al. Robust beamforming using sensors with nonidentical directivity patterns
JP2019103011A (en) Converter, conversion method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant