CN115331691A - Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium - Google Patents
Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium Download PDFInfo
- Publication number
- CN115331691A CN115331691A CN202211250290.6A CN202211250290A CN115331691A CN 115331691 A CN115331691 A CN 115331691A CN 202211250290 A CN202211250290 A CN 202211250290A CN 115331691 A CN115331691 A CN 115331691A
- Authority
- CN
- China
- Prior art keywords
- layer
- sound signal
- sampling
- module
- sampling module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000003860 storage Methods 0.000 title claims description 16
- 230000005236 sound signal Effects 0.000 claims abstract description 210
- 230000009467 reduction Effects 0.000 claims abstract description 44
- 238000012545 processing Methods 0.000 claims abstract description 42
- 238000013528 artificial neural network Methods 0.000 claims abstract description 24
- 238000005070 sampling Methods 0.000 claims description 149
- 230000006870 function Effects 0.000 claims description 37
- 238000004590 computer program Methods 0.000 claims description 18
- 230000005284 excitation Effects 0.000 claims description 15
- 238000001914 filtration Methods 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 8
- 238000001514 detection method Methods 0.000 claims description 6
- 238000009825 accumulation Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 5
- 238000000605 extraction Methods 0.000 abstract description 4
- 230000004927 fusion Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 21
- 230000005540 biological transmission Effects 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011946 reduction process Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 241000170489 Upis Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 229910052799 carbon Inorganic materials 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention relates to an unmanned aerial vehicle pickup method, which comprises the following steps: acquiring an original sound signal to be processed; carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal; and inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal. Compared with the prior art, the unmanned aerial vehicle pickup method provided by the invention fuses the characteristics of different layers in the encoder and the decoder through the noise reduction neural network, the characteristics of different sizes extracted by different receptive fields are fully utilized, the extraction accuracy of effective sound signals can be improved through the fusion of multi-scale characteristics, and the human voice enhancement effect under the environment with extremely low signal-to-noise ratio of the unmanned aerial vehicle is achieved by aiming at the high-decibel self-noise and the wind noise reduction of the unmanned aerial vehicle platform.
Description
Technical Field
The invention relates to the technical field of pickup of unmanned aerial vehicles, in particular to a pickup method and device of an unmanned aerial vehicle, electronic equipment and a computer readable storage medium.
Background
There is significant self-noise in the course of drone flight, including steady-state drone mechanical noise, as well as paddle noise generated when the unsteady propeller is rotating and wind noise generated by the propeller causing air flow. Unmanned aerial vehicle's the general more than 90 decibels of self-noise, be greater than effective sound such as the people's voice received far away, effective sound is longer from the propagation distance of ground sound source to unmanned aerial vehicle microphone moreover, and there is the decay in the propagation of effective sound in the air. Under this low SNR's environment, the effective sound signal that the microphone of mount on unmanned aerial vehicle received is drowned in unmanned aerial vehicle's self-noise, and the effective sound signal that the unmanned aerial vehicle microphone was difficult to effectively gather.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a pickup method of an unmanned aerial vehicle, which can weaken the self-noise of a human and a machine and improve the signal-to-noise ratio of sound signals, so that an unmanned aerial vehicle microphone can effectively acquire effective sound signals.
The invention is realized by the following technical scheme: an unmanned aerial vehicle pickup method comprises the following steps:
acquiring an original sound signal to be processed;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer on the same layer; the first convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the down-sampling module;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional deconvolution layer, and the up-sampling layer of the up-sampling module of the first layer is used for performing up-sampling operation on the sound signal output by the first convolution module; the up-sampling layers from the second layer to the last layer of the up-sampling module are used for performing up-sampling operation on the sound signal output by the up-sampling module in the last layer; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer on the same layer with the feature extracted by the one-dimensional convolution layer of the up-sampling module on the same layer, and performing linear interpolation operation; the splicing layers from the second layer of up-sampling module to the last layer of up-sampling module are used for splicing the sound signal output by the up-sampling layer at the same layer with the extracted feature of the one-dimensional convolution layer of the down-sampling module at the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module at the same layer; the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer; and the second convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the up-sampling module.
Compared with the prior art, the unmanned aerial vehicle pickup method provided by the invention fuses the characteristics of different layers in the encoder and the decoder through the noise reduction neural network, the characteristics of different sizes extracted by different receptive fields are fully utilized, the extraction accuracy of effective sound signals can be improved through the fusion of multi-scale characteristics, and the human voice enhancement effect under the environment with extremely low signal-to-noise ratio of the unmanned aerial vehicle is achieved by aiming at the high-decibel self-noise and the wind noise reduction of the unmanned aerial vehicle platform.
Further, the excitation function of the one-dimensional convolution layer is a linear rectification function with leakage; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
Further, the original sound signal is collected by a microphone linear array;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal, and comprising the following steps of:
performing framing and windowing processing on the original sound signal;
in a preset angle range, calculating a P value of each frame of the original sound signal aiming at each angle, and determining an angle corresponding to the maximum P value as a sound source direction of the frame, wherein the expression of the P value is as follows:
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f, f is the frequency of the original sound signal subjected to fourier transform, c is the speed of sound propagation in air;short-time Fourier transform of the l frame sound signal of the n path of original sound signal;for the delay phase of the l frame sound signal of the n-th path of the original sound signal,It is the frequency of the effective sound that,d is the pitch of the microphones of the linear array of microphones,to calculate the angle;
for each frame of the original sound signal, obtaining an enhanced sound signal X according to the sound source direction, wherein the expression of the enhanced sound signal X is as follows:
Further, before inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal, the method further comprises the following steps:
and after the enhanced sound signal is input into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the subsequent step when continuous effective sound is detected.
Based on the same inventive concept, the application also provides an unmanned aerial vehicle pickup device, which comprises:
the signal acquisition module is used for acquiring an original sound signal to be processed;
the signal enhancement module is used for carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
the noise reduction processing module is used for inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional deconvolution layer, and the up-sampling layer is used for performing up-sampling operation on a sound signal output by the first convolution module or a sound signal output by the last up-sampling module; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer with the characteristics extracted by the one-dimensional convolution layer of the same layer of the up-sampling module and performing linear interpolation operation; the splicing layer from the second layer of up-sampling module to the last layer of up-sampling module is used for splicing the sound signal output by the up-sampling layer with the extracted feature of the one-dimensional convolution layer of the up-sampling module on the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module on the last layer of the up-sampling module on the same layer; and the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer.
Further, the excitation function of the one-dimensional convolution layer is a leaky linear rectification function; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
Further, the original sound signal is collected by a microphone linear array;
the signal enhancement module includes:
the framing windowing submodule is used for performing framing windowing processing on the original sound signal;
a sound source direction sub-module, configured to calculate, for each angle within a preset angle range, a P value of each frame of the original sound signal, and determine an angle corresponding to the maximum P value as a sound source direction of the frame, where the P value expression is:
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f, f is the frequency of the original sound signal subjected to fourier transform, c is the speed of sound propagation in air;short-time Fourier transform of the l frame sound signal of the n path of original sound signal;for the delay phase of the l frame sound signal of the n-th path of the original sound signal,It is the frequency of the effective sound that,d is the microphone pitch of the linear array of microphones,to calculate the angle;
a signal accumulation submodule, configured to obtain, for each frame of the original sound signal, an enhanced sound signal X according to the sound source direction, where an expression of the enhanced sound signal X is:
Further, still include:
and the continuous effective sound detection module is used for inputting the enhanced sound signal into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the noise reduction processing module when the continuous effective sound is detected.
Based on same inventive concept, this application still provides an unmanned aerial vehicle, includes the fuselage, includes:
the microphone array is arranged on the machine body and used for collecting original sound signals and transmitting the original sound signals to the controller;
a controller, comprising:
a processor;
a memory for storing a computer program for execution by the processor;
wherein the processor implements the steps of the above method when executing the computer program.
Based on the same inventive concept, the present application also provides a computer-readable storage medium on which a computer program is stored, which when executed performs the steps of the above-described method.
For a better understanding and practice, the invention is described in detail below with reference to the accompanying drawings.
Drawings
Fig. 1 is a schematic diagram of an exemplary application environment of a pickup method of an unmanned aerial vehicle according to an embodiment;
fig. 2 is a schematic flow chart of a pickup method of an unmanned aerial vehicle according to an embodiment;
FIG. 3 is a flow diagram of a spatial filtering process according to one embodiment;
FIG. 4 is a schematic diagram of a noise reduction neural network in one embodiment;
FIG. 5 is a time domain diagram of an original sound signal collected from a sound source;
FIG. 6 is a time domain diagram of a valid sound signal;
fig. 7 is a schematic structural diagram of a pickup device of an unmanned aerial vehicle in one embodiment;
fig. 8 is a schematic structural diagram of the drone in one embodiment.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.
In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, nor should be construed to indicate or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The invention reduces the noise of the sound signal collected by the unmanned aerial vehicle by improving the noise reduction sound network from the U-NET + + and LSTM basic network framework, and is particularly suitable for the pickup environment with low signal-to-noise ratio. The following examples are intended to illustrate the details.
Please refer to fig. 1, which is a schematic diagram of an exemplary application environment of the pickup method for an unmanned aerial vehicle according to an embodiment, and includes an unmanned aerial vehicle microphone 11 and a remote controller 12, where the unmanned aerial vehicle microphone 11 is a sound receiving device mounted on the unmanned aerial vehicle, and may be a microphone array or the like; the remote controller 12 comprises a memory in which a computer program is stored and a processor in which the computer program is executable in the memory. Remote transmission to remote controller 12 after unmanned aerial vehicle microphone 11 gathers sound signal, realization such as remote transmission accessible bluetooth module, wireless wifi module, remote controller 12 handles the sound signal of receipt through the unmanned aerial vehicle pickup method of this embodiment, obtains clear effective sound signal.
Please refer to fig. 2, which is a flowchart illustrating an exemplary method for picking up sound by an unmanned aerial vehicle. The method comprises the following steps:
s1: acquiring an original sound signal to be processed;
s2: carrying out primary noise reduction processing on an original sound signal to obtain an enhanced sound signal;
s3: and inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal.
In step S1, the original sound signal is a sound signal directly collected by a microphone mounted on the unmanned aerial vehicle, and the original sound signal can be obtained through wired or wireless transmission with the microphone.
In step S2, a preliminary noise reduction process is performed on the original sound signal to enhance the effective sound in the original sound signal, where the preliminary noise reduction process is related to the structure of the microphone that collects the original sound signal. Please refer to fig. 3, which is a flowchart illustrating a spatial filtering process according to an embodiment, including the following steps:
s21: performing frame windowing processing on an original sound signal;
the original sound signal is subjected to frame windowing processing, so that short-time analysis is performed on the original sound signal, and processing of non-stationary signals is facilitated.
S22: calculating the P value of each frame of original sound signals aiming at each angle within a preset angle range, and determining the angle corresponding to the maximum P value as the sound source direction of the frame;
wherein, predetermine the angle scope and can set for according to microphone and unmanned aerial vehicle's relative position, for example, when the microphone was located unmanned aerial vehicle's dead ahead, the direction probability of effective sound was located unmanned aerial vehicle's front side, and unmanned aerial vehicle oar is noisy and is located the microphone dead back, then predetermine the angle scope and can set upIs in front of the unmanned planeAnd (4) degree to reduce the amount of calculation.
The P value of the original sound signal is a spatial filter function, and its expression is:
wherein m is the number of microphones in the linear array of microphones; n is the original sound signal circuit of the nth microphone; k = w/c, w =2 × pi × f, f is the frequency obtained by fourier transforming the time domain signal of the original sound signal, and c is the speed of sound propagation in air; l is the original sound signal of the l frame;short-time Fourier transform of the l frame sound signal of the n path original sound signal;a delayed phase of the l frame sound signal as the n-th original sound signal,It is the frequency of the effective sound that,d is the microphone pitch of the linear array of microphones,to calculate the angle.
S23: and aiming at each path of original sound signal of the same frame, obtaining the delay phase of the sound source direction of the frame, and accumulating the original sound signals of all the paths to obtain an enhanced sound signal.
Wherein the expression of the enhanced sound signal X is:
In step S3, the noise reduction neural network performs further human voice enhancement and noise reduction processing on the enhanced sound signal. Please refer to fig. 4, which is a schematic structural diagram of a noise reduction neural network in an embodiment, the noise reduction neural network includes an encoder and a decoder. The encoder is used for down-sampling and feature extraction of an input enhanced sound signal, and the decoder is used for up-sampling features output by the encoder and then outputting an effective sound signal.
Specifically, the encoder includes a plurality of Downsampling modules (Downsampling blocks) and a first Convolution module (1D Convolution), each Downsampling module includes a one-dimensional Convolution layer (1D Convolution) and a Downsampling layer (Downsampling), wherein the one-dimensional Convolution layer is configured to perform Convolution operation on the enhanced sound signal or the sound signal output by the last sampling module to extract features, in a specific implementation, the step size of the one-dimensional Convolution layer is set to be 2, the Convolution kernel size is set to be 15, and the excitation function is a leakage linear rectification function (leak ReLU); the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer, and the signal output by the down-sampling layer is the sound signal output by the down-sampling module of the layer. The first convolution module is configured to perform a one-dimensional convolution operation on the sound signal output by the last layer down-sampling module to extract features, and in one embodiment, the convolution kernel size of the first convolution module is set to 15.
The decoder comprises a plurality of up-sampling modules (Uwnsampling blocks) and a second Convolution module (1D Convolution) which are connected in sequence, wherein the down-sampling modules correspond to the up-sampling modules layer by layer, namely the down-sampling module at the first layer corresponds to the up-sampling module at the last layer, the down-sampling module at the second layer corresponds to the up-sampling module at the last layer, and the like. Each up-sampling module comprises an up-sampling layer (Uwnsampling), a splicing layer and a one-dimensional deconvolution layer (1D volume), wherein the up-sampling layer is used for carrying out up-sampling operation on the sound signal output by the first Convolution module or the sound signal output by the up-sampling module of the previous layer; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer with the features extracted by the one-dimensional convolution layer of the up-sampling module on the same layer, namely feature skip connection (feature skip connect), and performing linear interpolation operation; the splicing layer from the second layer of up-sampling module to the last layer of up-sampling module is used for performing feature hopping connection on the sound signal output by the up-sampling layer and the feature extracted by the one-dimensional convolution layer of the same layer of down-sampling module, and splicing the sound signal with the feature extracted by the one-dimensional convolution layer of the last layer of down-sampling module of the same layer, namely sampling skip connection (sampling skip connection); the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signal output by the splicing layer, in a specific embodiment, the step length of the one-dimensional convolution layer in the decoder is set to be 2, the convolution kernel size is set to be 15, the excitation function of the one-dimensional deconvolution layer from the first layer up-sampling module to the second last layer up-sampling module is a linear rectification function, and the excitation function of the last layer up-sampling module is a Sigmod function. The second convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of up-sampling module and outputting an effective sound signal, and preferably, the convolution kernel of the second convolution module is 1, so that the original sound signal can be fully utilized without changing the length of sound data, noise in the original sound signal is suppressed, and pure effective sound is restored.
The effective sound signal is a sound signal which needs to be collected actually and can be preset as a human sound signal and the like.
In an alternative embodiment, the encoder comprises 12 downsampling modules and the decoder comprises 12 upsampling modules.
And when the noise reduction neural network is trained, the loss function adopts a mean square error loss function. In one specific implementation, the training is performed based on Quadro P1000G video memory GPU and audio with a sampling rate of 16K, the batch processing size is 30, an Adam optimizer is adopted, and the optimizer parameters are as follows: the initial learning rate is 0.001, the first-order moment estimation exponential decay rate is 0.9, and the second-order moment estimation exponential decay rate is 0.99. The homemade data set includes pure human voice samples based on the ST-CMDS-20170001_1-OS data set and noise samples that mix wind and paddle noise at different signal-to-noise ratios (5,0, -5, -10, -15). The data set comprises 40 ten thousand data samples, wherein each of the pure human voice samples and the mixed noise samples comprises 20 ten thousand, 15 ten thousand noise samples are used as training sets, 3 ten thousand are used as verification sets, and 2 ten thousand are used as test sets.
In a preferred embodiment, before the enhanced sound signal is input into the noise reduction neural network for processing to obtain the effective sound signal, the method further comprises the following steps: and after the enhanced sound signal is input into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, continuously detecting through a sliding window, and entering the subsequent step when the continuous effective sound is detected. The filtering frequency band of the band-pass filter is set to be within the frequency range of the effective sound, and if the effective sound is human sound, the filtering frequency band of the band-pass filter can be set to be 300-3500Hz. The VAD algorithm, voice Activity Detection, can detect the start point and the end point of a valid sound in a noise background. Because the noise reduction processing of the noise reduction neural network involves a large amount of calculation, the calculation force requirement on the chip is high, and after continuous effective sound exists in the enhanced sound signal is detected, the enhanced sound signal is input into the noise reduction neural network for processing, so that the calculation pressure of the chip can be reduced, the heat emission is reduced, and the service life of the chip is prolonged.
Compared with the prior art, the noise reduction neural network disclosed by the invention has the advantages that the characteristics of different layers in the encoder and the decoder are fused, the characteristics of different sizes extracted from different receptive fields are fully utilized, the extraction accuracy of effective sound signals can be improved by the fusion of multi-scale characteristics, and the human voice enhancement effect under the environment with extremely low signal-to-noise ratio of the unmanned aerial vehicle is achieved by aiming at the high-decibel self-noise and wind noise reduction of the unmanned aerial vehicle platform.
In addition, the invention collects sound through the microphone linear array and provides a spatial filtering processing algorithm aiming at the microphone linear array, so that effective sound can be enhanced directionally, and the effect of further denoising is achieved.
Please refer to fig. 5, which is a time domain diagram of an original sound signal collected from a sound source; please refer to fig. 6, which is a time domain diagram of an effective sound signal obtained after the original sound signal collected from the sound source is processed by the above-mentioned pickup method by the drone. The comparison shows that after the unmanned aerial vehicle pickup method is used for processing, noise in original sound signals is suppressed, and human voice is reserved.
Based on the same invention concept, the invention also provides an unmanned aerial vehicle pickup device. Please refer to fig. 7, which is a schematic structural diagram of an unmanned aerial vehicle sound pickup apparatus in an embodiment, the apparatus includes a signal obtaining module 21, a signal enhancing module 22, and a noise reduction processing module 23, wherein the signal obtaining module 21 is configured to obtain an original sound signal to be processed; the signal enhancement module 22 is configured to perform spatial filtering processing on the original sound signal to obtain an enhanced sound signal; the noise reduction processing module 23 is configured to input the enhanced sound signal into a noise reduction neural network for processing, so as to obtain an effective sound signal.
In an optional embodiment, the signal enhancement module 22 includes a framing windowing sub-module 221, a sound source direction sub-module 222, and a signal accumulation sub-module 223, where the framing windowing sub-module 221 is configured to perform framing windowing on the original sound signal; the sound source direction sub-module 222 is configured to calculate, for each angle within a preset angle range, a P value of each frame of original sound signals, and determine an angle corresponding to a maximum P value as a sound source direction of the frame; the signal accumulation sub-module 223 is configured to obtain a delay phase in the sound source direction of the frame for each path of original sound signals of the same frame, and accumulate the original sound signals of all the paths to obtain an enhanced sound signal.
In a preferred embodiment, the pickup apparatus for unmanned aerial vehicle further includes a continuous effective sound detection module 24, where the continuous effective sound detection module 24 is configured to detect an effective sound in the enhanced sound signal through a VAD algorithm after the enhanced sound signal is input into the band-pass filter for filtering, and perform continuous detection through a sliding window, and when a continuous effective sound is detected, enter a subsequent step.
For the device embodiments, reference is made to the description of the method embodiments for relevant details, since they correspond essentially to the method embodiments.
Based on above-mentioned unmanned aerial vehicle pickup method, this application still provides an unmanned aerial vehicle. Please refer to fig. 8, which is a schematic structural diagram of an embodiment of a drone, the drone includes a body 31, a microphone array 32, a support rod 33, a drone controller (not shown), and a remote controller (not shown). Wherein, the fuselage 31 is a flying carrier; the microphone array 32 is arranged on the main body 31 through a support rod 33, the microphone array 32 can be arranged in the direction of 45 degrees right in front of or right above the main body 31, and the microphone array 32 can be selected to be a linear array consisting of 2-4 microphones. For the case that the microphone array 32 is disposed right in front of the body 31, the microphone may be selected as a cardioid directional microphone; for the case that the microphone array 32 is arranged in the 45-degree direction right in front of and above the body 31, the microphone can be an 8-shaped microphone; whereby the directivity of sound collection can be improved. The support rods 33 may be selected from elongated light carbon tubes. The unmanned aerial vehicle controller comprises a pickup module, a data transmission module and a shouting module, wherein the pickup module is used for receiving original sound signals collected by the microphone array 32; the data transmission module is used for remotely transmitting the original sound signal in the pickup module to the remote controller and receiving a calling voice signal from the remote controller; the calling module is used for receiving and playing the calling voice signal in the data transmission module. The remote controller comprises one or more processors and a memory, wherein the processors are used for executing the unmanned aerial vehicle pickup method of the program implementation method embodiment; the memory is for storing a computer program executable by the processor.
Based on the same inventive concept, the present invention further provides a computer-readable storage medium, corresponding to the foregoing embodiments of the sound pickup method for the unmanned aerial vehicle, wherein the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the sound pickup method for the unmanned aerial vehicle, which are described in any one of the foregoing embodiments.
This application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, to those skilled in the art, changes and modifications may be made without departing from the spirit of the present invention, and it is intended that the present invention encompass such changes and modifications.
Claims (10)
1. The pickup method of the unmanned aerial vehicle is characterized by comprising the following steps:
acquiring an original sound signal to be processed;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer on the same layer; the first convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the down-sampling module;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional anti-convolution layer, and the up-sampling layer of the up-sampling module of the first layer is used for performing up-sampling operation on the sound signal output by the first convolution module; the up-sampling layers from the second layer to the last layer of the up-sampling module are used for performing up-sampling operation on the sound signal output by the up-sampling module in the last layer; the splicing layer of the first layer up-sampling module is used for splicing the sound signal output by the up-sampling layer at the same layer with the characteristics extracted by the one-dimensional convolution layer of the up-sampling module at the same layer and performing linear interpolation operation; the splicing layers from the second layer of up-sampling module to the last layer of up-sampling module are used for splicing the sound signal output by the up-sampling layer at the same layer with the extracted feature of the one-dimensional convolution layer of the down-sampling module at the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module at the same layer; the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer; and the second convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the up-sampling module.
2. The method of claim 1, wherein: the excitation function of the one-dimensional convolution layer is a linear rectification function with leakage; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
3. The method of claim 1, wherein: the original sound signal is collected through a microphone linear array;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal, and comprising the following steps of:
performing frame windowing processing on the original sound signal;
in a preset angle range, calculating a P value of each frame of the original sound signal aiming at each angle, and determining an angle corresponding to the maximum P value as a sound source direction of the frame, wherein the expression of the P value is as follows:
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f, f is the frequency of the original sound signal subjected to fourier transform, c is the speed of sound propagation in air;short-time Fourier transform of the l frame sound signal of the n path of original sound signal;for the delay phase of the l frame sound signal of the n-th path of the original sound signal,It is the frequency of the effective sound that is,d is the microphone pitch of the linear array of microphones,to calculate the angle;
for each frame of the original sound signal, obtaining an enhanced sound signal X according to the sound source direction, wherein the expression of the enhanced sound signal X is as follows:
4. The method of claim 1, wherein: before inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal, the method further comprises the following steps:
and after the enhanced sound signal is input into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the subsequent step when continuous effective sound is detected.
5. The utility model provides an unmanned aerial vehicle pickup apparatus which characterized in that includes:
the signal acquisition module is used for acquiring an original sound signal to be processed;
the signal enhancement module is used for carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
the noise reduction processing module is used for inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolutional layer;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional deconvolution layer, and the up-sampling layer is used for performing up-sampling operation on a sound signal output by the first convolution module or a sound signal output by the last up-sampling module; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer with the characteristics extracted by the one-dimensional convolution layer of the same layer of the up-sampling module and performing linear interpolation operation; the splicing layer from the second layer of up-sampling module to the last layer of up-sampling module is used for splicing the sound signal output by the up-sampling layer with the extracted feature of the one-dimensional convolution layer of the up-sampling module on the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module on the last layer of the up-sampling module on the same layer; and the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer.
6. The apparatus of claim 5, wherein: the excitation function of the one-dimensional convolution layer is a linear rectification function with leakage; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
7. The apparatus of claim 5, wherein: the original sound signal is collected through a microphone linear array;
the signal enhancement module includes:
the framing windowing submodule is used for performing framing windowing processing on the original sound signal;
a sound source direction submodule, configured to calculate, for each angle within a preset angle range, a P value of each frame of the original sound signal, and determine an angle corresponding to a maximum P value as a sound source direction of the frame, where the P value expression is:
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f,f is the frequency of the original sound signal subjected to Fourier transform, c is the speed of sound propagation in air;short-time Fourier transform of the l frame sound signal of the n path of original sound signal;a delay phase of the l frame sound signal for the n-th path of the original sound signal,It is the frequency of the effective sound that,d is the microphone pitch of the linear array of microphones,to calculate the angle;
a signal accumulation submodule, configured to obtain, for each frame of the original sound signal, an enhanced sound signal X according to the sound source direction, where an expression of the enhanced sound signal X is:
8. The apparatus of claim 5, further comprising:
and the continuous effective sound detection module is used for inputting the enhanced sound signal into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the noise reduction processing module when the continuous effective sound is detected.
9. An unmanned aerial vehicle, includes the fuselage, its characterized in that includes:
the microphone array is arranged on the machine body and used for collecting original sound signals and transmitting the original sound signals to the controller;
a controller, comprising:
a processor;
a memory for storing a computer program for execution by the processor;
wherein the processor, when executing the computer program, implements the steps of the method of any one of claims 1-4.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed, carries out the steps of the method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211250290.6A CN115331691A (en) | 2022-10-13 | 2022-10-13 | Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211250290.6A CN115331691A (en) | 2022-10-13 | 2022-10-13 | Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115331691A true CN115331691A (en) | 2022-11-11 |
Family
ID=83913561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211250290.6A Pending CN115331691A (en) | 2022-10-13 | 2022-10-13 | Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115331691A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109831731A (en) * | 2019-02-15 | 2019-05-31 | 杭州嘉楠耘智信息科技有限公司 | Sound source orientation method and device and computer readable storage medium |
CN109949821A (en) * | 2019-03-15 | 2019-06-28 | 慧言科技(天津)有限公司 | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN |
CN112904279A (en) * | 2021-01-18 | 2021-06-04 | 南京工程学院 | Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum |
CN114333796A (en) * | 2021-12-27 | 2022-04-12 | 深圳Tcl数字技术有限公司 | Audio and video voice enhancement method, device, equipment, medium and smart television |
-
2022
- 2022-10-13 CN CN202211250290.6A patent/CN115331691A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109831731A (en) * | 2019-02-15 | 2019-05-31 | 杭州嘉楠耘智信息科技有限公司 | Sound source orientation method and device and computer readable storage medium |
CN109949821A (en) * | 2019-03-15 | 2019-06-28 | 慧言科技(天津)有限公司 | A method of far field speech dereverbcration is carried out using the U-NET structure of CNN |
CN112904279A (en) * | 2021-01-18 | 2021-06-04 | 南京工程学院 | Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum |
CN114333796A (en) * | 2021-12-27 | 2022-04-12 | 深圳Tcl数字技术有限公司 | Audio and video voice enhancement method, device, equipment, medium and smart television |
Non-Patent Citations (3)
Title |
---|
CARIG MACARTNEY ET AL: "Improved Speech Enhancement with the Wave-U-Net", 《ARXIV》 * |
DANIEL STOLLER ET AL: "WAVE-U-NET: A MULTI-SCALE NEURAL NETWORK FOR END-TO-END AUDIO SOURCE SEPARATION", 《ARXIV》 * |
袁安富等: "一种改进的联合 SRP-PHAT 语音定位算法", 《南京信息工程大学学报》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11620983B2 (en) | Speech recognition method, device, and computer-readable storage medium | |
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN106846803B (en) | Traffic event detection device and method based on audio frequency | |
CN111933188A (en) | Sound event detection method based on convolutional neural network | |
CN110245608A (en) | A kind of Underwater targets recognition based on semi-tensor product neural network | |
CN110556103A (en) | Audio signal processing method, apparatus, system, device and storage medium | |
Tang et al. | Improving reverberant speech training using diffuse acoustic simulation | |
CN105637331B (en) | Abnormal detector, method for detecting abnormality | |
WO2019239043A1 (en) | Location of sound sources in a given acoustic environment | |
CN111031463B (en) | Microphone array performance evaluation method, device, equipment and medium | |
KR20190108804A (en) | Method and apparatus of sound event detecting robust for frequency change | |
CN113205803B (en) | Voice recognition method and device with self-adaptive noise reduction capability | |
CN113436640B (en) | Audio noise reduction method, device and system and computer readable storage medium | |
CN112466290B (en) | Abnormal sound detection model training method and device and computer storage medium | |
EP4046390B1 (en) | Improved location of an acoustic source | |
CN112259116A (en) | Method and device for reducing noise of audio data, electronic equipment and storage medium | |
EP4248231A1 (en) | Improved location of an acoustic source | |
CN115565550A (en) | Baby crying emotion identification method based on characteristic diagram light convolution transformation | |
WO2018003158A1 (en) | Correlation function generation device, correlation function generation method, correlation function generation program, and wave source direction estimation device | |
CN115598594B (en) | Unmanned aerial vehicle sound source positioning method and device, unmanned aerial vehicle and readable storage medium | |
CN115331691A (en) | Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium | |
CN112735466A (en) | Audio detection method and device | |
CN103617798A (en) | Voice extraction method under high background noise | |
CN111916060A (en) | Deep learning voice endpoint detection method and system based on spectral subtraction | |
CN107919136B (en) | Digital voice sampling frequency estimation method based on Gaussian mixture model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20221111 |
|
RJ01 | Rejection of invention patent application after publication |