CN116386589A - Deep learning voice reconstruction method based on smart phone acceleration sensor - Google Patents
Deep learning voice reconstruction method based on smart phone acceleration sensor Download PDFInfo
- Publication number
- CN116386589A CN116386589A CN202310387588.XA CN202310387588A CN116386589A CN 116386589 A CN116386589 A CN 116386589A CN 202310387588 A CN202310387588 A CN 202310387588A CN 116386589 A CN116386589 A CN 116386589A
- Authority
- CN
- China
- Prior art keywords
- loss
- mel
- voice
- signal
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000001133 acceleration Effects 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 16
- 238000013135 deep learning Methods 0.000 title claims abstract description 10
- 238000005070 sampling Methods 0.000 claims abstract description 19
- 230000000694 effects Effects 0.000 claims abstract description 16
- 238000012545 processing Methods 0.000 claims abstract description 15
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 238000011156 evaluation Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000011049 filling Methods 0.000 claims description 2
- 230000015572 biosynthetic process Effects 0.000 abstract description 11
- 238000003786 synthesis reaction Methods 0.000 abstract description 11
- 241000282414 Homo sapiens Species 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 239000000306 component Substances 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 108010015780 Viral Core Proteins Proteins 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- User Interface Of Digital Computer (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Telephone Function (AREA)
Abstract
The invention discloses a deep learning voice reconstruction method based on a smart phone acceleration sensor, which comprises the steps of firstly collecting data, and collecting mainboard vibration signals caused by a smart phone loudspeaker in a plurality of frequency sampling modes by combining a plurality of motion sensors; then, data processing is carried out, and sensor signals are processed through the steps of linear interpolation, noise rejection, feature extraction and the like; then, carrying out voice reconstruction, and providing a voice reconstruction algorithm for generating an countermeasure network based on the multi-scale time-frequency domain of the wavelet to convert the preprocessed motion sensor data into voice waveform data; and finally, evaluating the effect, and evaluating the generated voice through subjective and objective indexes. The invention improves the high-frequency performance and the robustness of the voice synthesis, so that the synthesized voice is more similar to the original voice.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a deep learning voice reconstruction method.
Background
Along with the research and development of the basic technology of the mobile internet, electronic commerce, social networks and new media industries are vigorously developed in the past ten years, and the consumer market of mobile intelligent terminals represented by intelligent mobile phones is expanded for the last ten years. According to the latest data of Statista of a statistical organization, the worldwide smart phone user scale reaches 66.4 hundred million by the third quarter of 2022, and accounts for 83.32% of the world population. The number of users of the smart phone is synchronously and explosive-type, and the number of users of the smart phone is an increasingly compact coupling relationship between daily life of human beings and intelligent equipment. The motion sensor is taken as an indispensable important component in the design paradigm of the modern mobile intelligent equipment, bears key responsibilities of sensing the external environment of the equipment, identifying the motion state of the equipment and reading interactive input of a user, and is widely applied to various mobile terminals. A motion sensor represented by an acceleration sensor is generally mounted on a smart phone motherboard, and is tightly coupled with core elements including a smart phone processor, a speaker, a microphone, and the like, so as to jointly serve the operation of the core system.
Within the handset, the speaker and a number of sensors are integrated on a circuit board, which can be regarded as an efficient solid transmission medium. When the speaker is in operation, sound vibrations propagate through the entire main board, which allows a motion sensor placed on the same surface to capture solid vibrations caused by the speaker. Moreover, since the motion sensor and speaker are integrated on the same circuit board in physical contact, in close proximity to each other, the voice signal emitted by the speaker will always have a significant impact on the motion sensor (such as a gyroscope and accelerometer) no matter how the smartphone is placed (on a desk or in a hand). These motion sensors are sensitive to vibration signals, so that the signals captured by them always contain acoustic vibrations from the phone speaker to the motherboard.
In general, previous studies expressed this as a classification problem and applied machine learning based solutions to construct mappings between features extracted from non-acoustic signals and words. Numerous studies have demonstrated the feasibility of identifying numbers, words, and even key words from sensor vibration signals. For example, university team research in Zhejiang has found that the sampling frequency of the smart phone built-in acceleration sensor released after 2018 is up to 500Hz, which covers almost the entire baseband (85-255 Hz) of adult voices. They propose a speech recognition system based on deep learning, which collects the speech signal sent by the loudspeaker through the low-authority spy application program, converts the speech signal into a spectrogram, and then uses DenseNet as a basic network to classify and recognize the speech information (text) carried by the spectrogram of the acceleration signal. Han et al propose a distributed side channel attack by using vibration signals captured by a sensor network (including a detector, an accelerometer, and a gyroscope), and in order to solve the problem of a lower sampling frequency of the sensor, they use a distributed form of TI-ADC (Time-Interleaved analog-to-digital converter) to approach the overall high sampling frequency while maintaining a lower node sampling rate. However, existing work can only recognize a few words with a very limited vocabulary.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a deep learning voice reconstruction method based on a smart phone acceleration sensor, which comprises the steps of firstly collecting data, and collecting mainboard vibration signals caused by a smart phone loudspeaker in a plurality of frequency sampling modes by combining a plurality of motion sensors; then, data processing is carried out, and sensor signals are processed through the steps of linear interpolation, noise rejection, feature extraction and the like; then, carrying out voice reconstruction, and providing a voice reconstruction algorithm for generating an countermeasure network based on the multi-scale time-frequency domain of the wavelet to convert the preprocessed motion sensor data into voice waveform data; and finally, evaluating the effect, and evaluating the generated voice through subjective and objective indexes. The invention improves the high-frequency performance and the robustness of the voice synthesis, so that the synthesized voice is more similar to the original voice.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
step 1: collecting data;
playing the audio file by using a mobile phone;
collecting signals of an acceleration sensor, and recording the signals and corresponding time stamps;
step 2: data processing; performing linear interpolation, noise processing and feature extraction on the sensor signals;
step 2-1: linear interpolation;
locating all time points without acceleration data through time stamps, and filling the missing data by using linear interpolation; for acceleration data with the same time stamp, adopting a mean value taking means, wherein the mean value represents the acceleration data of the time stamp, so that each time stamp has only one acceleration data;
step 2-2: noise processing;
removing noise caused by gravity factors and hardware factors from sensor signals by using reference data under a silent condition; filtering with a high pass filter having a cut-off frequency of aHz;
step 2-3: feature extraction
Dividing the sensor signal into a plurality of segments with fixed overlap, setting the segment length and the overlap length as 256 and 64 respectively, windowing each segment by using a hamming window, calculating a frequency spectrum by short-time fourier transform (STFT) to obtain an STFT matrix, recording the amplitude and phase of each time and frequency, and converting into a corresponding spectrogram according to formula (1):
spectrogram{x(n)}=|STFT{x(n)} 2 (1)
wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal;
converting the spectrogram into a Mel spectrogram, and realizing the frequency f and Mel scale f on the spectrogram corresponding to the acceleration sensor signal through the steps (2) and (3) mel Interconversion between:
finally, converting the spectrogram into a Mel spectrogram by using a formula (2) to obtain a characteristic-Mel spectrogram;
step 3: reconstructing voice;
generating a voice reconstruction model WMelGAN of an countermeasure network by adopting a multi-scale time-frequency domain based on wavelets, and converting the sensor data processed in the step 2 into a synthetic voice signal;
step 3-1: the wavelet-based multi-scale time-frequency domain generation of a speech reconstruction model of an countermeasure network includes three components: a generator, a multi-scale discriminator, and a wavelet discriminator;
the generator converts the Mel spectrogram into a synthetic voice signal, the generator is formed by a series of up-sampling transposed convolution layers, and a residual error network with a cavity convolution layer is arranged behind each transposed convolution layer;
different sub-discriminants of the multi-scale discriminant have model structures based on a convolution network and work on different data scales to judge the output results of generators of different scales; the network structure of each sub-arbiter is a down-sampling structure consisting of a layer 1-dimensional convolution, a layer 4-dimensional grouping convolution and a layer one-dimensional sum in sequence, and the input of the sub-arbiter comprises an original voice signal and a synthesized voice signal generated by a generating network device;
the wavelet discriminator decomposes the input voice signal into four sub-signals with different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using a stacked convolutional neural network; in the WMelGAN model, the generator and the plurality of discriminators train in a countermeasure mode, so that the audio generated by the generator reaches the effect that the discriminators cannot judge true or false, and finally, the generator is utilized to generate a final synthesized voice signal;
step 3-2: a loss function;
the countermeasure training between the generator and the arbiter is performed by setting a series of loss functions, the targets are as shown in formulas (4) and (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Representing the overall loss function of two discriminators, loss G Representing a loss function of the generator; the loss of the generator is composed of five parts, namely a loss of the multi-scale discriminator gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD ;
The overall loss of two discriminators is divided into multi-scale discriminators loss disc_TD And wavelet discriminant loss disc_WD The definition is:
wherein x represents an original voice signal waveform, s represents a Mel spectrogram, z represents a Gaussian noise vector, TD and WD respectively represent a multi-scale discriminator and a wavelet discriminator, and subscript k represents different scales; g (s, z) represents the generated speech signal;representing the desire;
multi-scale discriminant loss of generator gen_TD Wavelet discriminant loss of sum generator gen_WD Is defined as:
wherein the method comprises the steps ofThree multi-scale discriminators corresponding to different scales respectively;
mel loss mel The difference between the original voice waveform and the synthesized voice waveform is quantized by utilizing a multi-scale mel spectrogram, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Representing the F-norm, wherein MEL (·) represents the MEL-gram transform for a given speech signal;
step 4: a idiomatic evaluation system; evaluating the intelligibility and naturalness index of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system;
step 4-1: subjective evaluation system uses mean opinion score MOS as evaluation index
Step 4-2: an objective evaluation system;
step 4-2-1: measuring the difference between the synthesized voice and the original voice signal by adopting three objective indexes of peak signal-to-noise ratio PSNR, mel cepstrum distortion MCD and root mean square error F0RMSE of fundamental frequency; the intelligibility of the synthesized voice is measured by adopting dictation test accuracy ADT;
peak signal-to-noise ratio PSNR measures the ratio between the maximum possible power of a signal and the noise power affecting its quality:
wherein S is peak Representing peak speech signals, S representing speech signals, N representing noise signals;
step 4-2-2: the difference between the original speech signal and the synthesized speech signal between the mel-cepstrum features MFCCs is quantized using the mel-cepstrum distortion MCD, specifically, the mel-cepstrum distortion of the k-th frame is expressed as:
where r represents the original speech signal, M is the number of Mel filters, MC s (i, k) and MC r (i, k) are MFCC coefficients of the synthesized speech signal s and the original speech signal r;
omitting the subscript of MC, expressed as:
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
where X (k, m) represents the k-th frame fourier transform result, w, of an input speech frame with frequency index m n (m) represents an nth mel filter;
step 4-2-3: the difference between the fundamental frequency of the original speech and the fundamental frequency of the synthesized speech is compared using the root mean square error F0RMSE of the fundamental frequency, expressed as:
where f0 represents the fundamental frequency characteristic of the original speech signal,representing the fundamental frequency characteristics of the synthesized speech signal.
Preferably, the a=20.
Preferably, the number m=10 of mel filters.
The beneficial effects of the invention are as follows:
the data acquisition of the invention can obtain the sampling signals of the multi-audio file motion sensor in a plurality of frequency sampling modes in batches. The data processing implements a speaker independent universal sensor speech synthesis framework that reduces reliance on a particular speaker data set. The high-frequency performance and the robustness of the voice synthesis are improved by optimizing the model architecture and introducing the wavelet discriminator in the voice reconstruction, so that the synthesized voice is more similar to the original voice. The generated voice evaluation demonstrates that the smart phone motion sensor has the sensing capability of restoring the voice of the loudspeaker.
Drawings
Fig. 1 is a schematic frame diagram of a deep learning voice reconstruction method based on a smart phone acceleration sensor.
Fig. 2 is a schematic diagram of a speech reconstruction algorithm for generating an countermeasure network based on a multi-scale time-frequency domain of a wavelet in the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention aims to provide a voice reconstruction method based on a smart phone acceleration sensor, which aims to solve the privacy safety problem contained in motion sensor data in the prior art and improve the quality and efficiency of voice synthesis.
The invention carries out voice reconstruction towards the signal of the built-in acceleration sensor of the intelligent mobile phone, establishes the internal association of the state between the built-in acceleration sensor and the speaker of the mobile phone, establishes the propagation model of the vibration signal between the acceleration sensor and the speaker of the mobile phone, and realizes the function of reconstructing the original voice signal from the accelerometer vibration caused by the speaker.
As shown in fig. 1, the present invention adopts the following technical scheme:
and (3) a step of: collecting data; and playing an audio file by utilizing a self-developed android APP, and selecting a plurality of motion sensors and a plurality of frequency sampling modes according to requirements to acquire mainboard vibration signals caused by a smart phone loudspeaker. The multi-audio file motion sensor sampling signal acquisition device has the beneficial effects that the multi-audio file motion sensor sampling signals under various frequency sampling modes can be obtained in batches.
And II: data processing; integrating a plurality of digital signal processing schemes, carrying out data preprocessing, noise analysis and elimination and data feature (Mel spectrogram) extraction on the motion sensor signals, carrying out noise modeling on the acceleration sensor signals, and separating out noise components in the perception data. The method has the advantages that a speaker-independent general sensor voice synthesis framework is realized, and the dependence on a specific speaker data set is reduced.
Thirdly,: reconstructing voice; as shown in fig. 2, a speech reconstruction algorithm for generating an countermeasure network based on a multi-scale time-frequency domain of wavelets is proposed. The sensor-voice mapping model based on the structure design of the generating countermeasure network converts the preprocessed motion sensor data into voice waveform data, a mapping function between denoising perception data and original voice data is established by using the countermeasure generating network, and a wavelet discriminator is introduced to improve the high-frequency effect of voice synthesis. The method has the advantages that the high-frequency performance and the robustness of the voice synthesis are improved through optimizing the model architecture and introducing the wavelet discriminator, so that the synthesized voice is closer to the original voice.
Fourth, the method comprises the following steps: and generating a voice evaluation system. And evaluating indexes such as the intelligibility, the naturalness and the like of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system. The intelligent mobile phone motion sensor has the beneficial effects that the intelligent mobile phone motion sensor has the perception capability of restoring the voice of the loudspeaker.
Specific examples:
the specific steps of the invention are as follows:
step 1: and (5) data acquisition.
And playing an audio file by using a self-developed android APP (application) sensor, selecting a plurality of motion sensors and a plurality of frequency sampling modes according to requirements, collecting a mainboard vibration signal caused by a smart phone loudspeaker, and storing the mainboard vibration signal to a local place. Specifically, the application program can acquire all audio files locally stored in the mobile phone through an interface provided by android, and display the audio files on an application homepage. The user can select the audio of the sensor data to be collected on the main page, and after clicking to start collection, the application program plays the audio and records the sensor signal. Specifically, for each piece of audio, the application program acquires a linear acceleration sensor object, an acceleration sensor object, a gyro sensor object, registers the linear acceleration sensor object, the acceleration sensor object, and the gyro sensor object as listeners (listeners), then creates a media play object (MediaPlayer), and starts playing the audio; during this time, the sensor will record each vibration signal and the corresponding time stamp; when the audio playing is finished, the MediaPlayer returns an ending signal, and the acquisition of the sensor signal is ended by logging out the identity of a listener of the motion sensor object; and finally, writing the linear acceleration signal, the acceleration signal and the gyroscope signal into the same-name CSV file of the audio frequency respectively. The application program can acquire all audio files locally stored in the mobile phone through an interface provided by android and display the audio files on an application homepage. The user can select the audio of the sensor data to be collected on the main page, and after clicking to start collection, the application program plays the audio and records the sensor signal. Specifically, for each piece of audio, the application program acquires a linear acceleration sensor object, an acceleration sensor object, a gyro sensor object, registers the linear acceleration sensor object, the acceleration sensor object, and the gyro sensor object as listeners (listeners), then creates a media play object (MediaPlayer), and starts playing the audio; during this time, the sensor will record each vibration signal and the corresponding time stamp; when the audio playing is finished, the MediaPlayer returns an ending signal, and the acquisition of the sensor signal is ended by logging out the identity of a listener of the motion sensor object; and finally, writing the linear acceleration signal, the acceleration signal and the gyroscope signal into the same-name CSV file of the audio frequency respectively. The user may select a variety of frequency sampling modes to capture the output signal of the motion sensor by himself.
Step 2: and (5) data processing.
Integrating a plurality of digital signal processing schemes, and carrying out data preprocessing, noise analysis and elimination and data feature (Mel spectrogram) extraction on the motion sensor signals. (1) The invention locates all time points without acceleration data by time stamp and fills up missing data by linear interpolation; for acceleration data with the same time stamp, the invention adopts a mean value taking means, and the mean value is used for representing the acceleration data of the time stamp, so that each time stamp has only one acceleration data. (2) Noise processing, namely eliminating noise caused by gravity factors and hardware factors in sensor signals by using reference data under a silence condition; high pass filtering with a cut-off frequency of 50Hz is used to remove the effects of human activity. (3) The invention divides the sensor signal into a plurality of short segments with fixed overlap, the lengths of the segments and the overlap are respectively set to 256 and 64, each segment is windowed by using a Hamming window, and the frequency spectrum of each segment is calculated by fast Fourier transform (STFT) to obtain an STFT matrix. The matrix records the amplitude and phase for each time and frequency and is based on the following equation:
spectrogram{x(n)}=|STFT{x(n)} 2 (1)
and converting into a corresponding spectrogram, wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal. The horizontal axis x of the spectrogram is time, the vertical axis y is frequency, and the corresponding value of (x, y) represents the magnitude of the frequency y at time x. In order to enable the frequency of the spectrogram to conform to the mode of logarithmic distribution of the frequency of human ears, the spectrogram is converted into the Mel spectrogram. The mutual conversion between the frequency (Hz) and mel scale (mel) on the corresponding spectrogram of the acceleration sensor signal is achieved by using the following two formulas:
finally, a Mel filter is used to obtain a feature-Mel spectrogram.
Step 3: and (5) reconstructing voice.
The preprocessed motion sensor data is converted into speech waveform data based on generating a sensor-to-speech mapping model against the network structure design. As shown in fig. 2, the WMelGAN model constructed based on the idea of generating an countermeasure network of the present invention includes three major core components: a generator, a multi-scale discriminant, and a wavelet discriminant. The generator aims at converting the mel-language spectrogram into an understandable voice audio waveform signal, converting the mel-language spectrogram sequence into a voice waveform signal with higher data quantity through a series of up-sampling transposed convolution networks, and arranging a residual network with a cavity convolution layer behind each transposed convolution module to obtain a larger receptive field, wherein the wider receptive field is helpful for capturing cross-space correlation between long vectors and learning acoustic structural characteristics. Different sub-discriminants of the multi-scale discriminant have similar model structures based on a convolutional network, and work on different data scales to judge the fitting effect of the generators of different scales. The network structure of each time domain discriminator is a downsampling structure formed by a layer 1-dimensional convolution and a layer 4-dimensional packet convolution, and the input of the network structure is formed by an original voice signal and a generated voice signal generated by a generating network. The wavelet discriminator decomposes the input voice signal into four sub-signals of different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using the stacked convolutional neural network. In the framework of WMelGAN, the generator and the plurality of discriminators train in a countermeasure manner, so that the audio generated by the generator achieves the effect that the set of discriminators cannot judge whether the audio is true or false, and finally, the generator is utilized to generate a high-quality speech signal with high intelligibility.
The invention provides a wavelet-based multi-scale time-frequency domain generation countermeasure network voice reconstruction algorithm, which carries out countermeasure training between a generator and a discriminator by setting a series of loss functions, and targets are shown as formula (4) (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Loss function representing discrimination network, loss G Representing a loss function of the generated network. Loss of discriminatorThe function is calculated from the characteristics of the multi-scale time domain discrimination network and the wavelet discrimination network output, and this loss function minimizes the L1 distance between the original speech signal and the discrimination network feature map that generated the speech signal. The loss of the generated network consists of five parts, namely loss of the multi-scale time domain discriminator respectively gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD . Multi-scale time domain discrimination network loss of discriminator disc_TD Wavelet discrimination network loss of sum discriminator disc_WD Is defined as:
where x represents the original waveform, s represents the acoustic features (e.g., mel-pattern), z represents gaussian noise vectors, and TD and WD represent the multi-scale time domain discrimination network and wavelet discrimination network, respectively. Multi-scale time domain discrimination network loss of generator gen_TD Wavelet discrimination network loss of sum generator gen_WD Is defined as:
wherein the method comprises the steps ofRepresenting the losses generated by the three time domain discriminators of different scales, respectively. Mel loss mel Is made of multiple scalesThe mel-gram quantifies the gap between the original speech waveform and the synthesized speech waveform, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Represents the F-norm, where MEL (·) represents the MEL-gram transform for a given speech signal.
Step 4: a speech evaluation is generated.
And evaluating indexes such as the intelligibility, the naturalness and the like of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system. Subjective evaluation system, refer to the field of speech synthesis. Since the final service object of speech synthesis is human, a general speech synthesis model uses a Mean Opinion Score (MOS) as an evaluation index. The research system refers to design of a subjective evaluation system in evaluation related to voice effect; objective evaluation system, refer to the human behavior recognition and signal processing field. Specifically, the invention adopts three objective indexes of peak signal-to-noise ratio (PSNR), mel cepstrum distortion (Mel-Cepstral Distortion, MCD) and root mean square error (F0 Root Mean Square Error, F0 RMSE) of fundamental frequency to measure the difference between the synthesized voice and the reference voice signals. In addition, the invention also adopts dictation test accuracy (Accuracy of Dictation Test, ADT) to measure the intelligibility of the synthesized voice. In order to quantify the quality of the synthesized speech signal reconstructed from the acceleration sensor signal, the peak signal-to-noise ratio (PSNR) may measure the ratio between the maximum possible power of the signal and the noise power affecting its quality:
in order to measure the distance between the original speech signal and the synthesized speech signal, the present invention proposes to quantify the difference between the mel-frequency cepstrum features (MFCCs) of these 2 signals using mel-frequency cepstrum distortion (MCD). Specifically, mel-cepstrum distortion of the kth frame can be expressed as:
where s represents the synthesized speech signal, r represents the original speech signal, M is the number of mel filters, m=10, mc in the present invention s (i, k) and MC r (i, k) are the MFCC coefficients of the synthesized speech signal s and the original speech signal r. For simplicity, after omitting the subscript of MC, it may be expressed as:
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
where X (k, m) represents the k-th frame Fourier transform result, w, of an input speech frame with a frequency index of m n (m) represents an nth Mel filter. Clearly, the lower the value of mel-frequency cepstrum distortion (MCD), the more nearly the synthesized audio reconstructed from the built-in acceleration sensor is to the original speech signal. The invention uses the root mean square error (F0 RMSE) of the fundamental frequency to compare the difference between the fundamental frequency of the original voice and the fundamental frequency of the synthesized voice, and the lower the value is, the closer the fundamental frequency outline between two voice signals is, and the better the generating effect is. Specifically, the root mean square error of the fundamental frequency is expressed as:
where f0 represents the fundamental frequency characteristic of the original speech signal,then the fundamental frequency characteristic of the synthesized speech signal is represented.
The invention uses dictation test accuracy and average opinion score as main index, peak signal-to-noise ratio, mel cepstrum distortion and root mean square error of fundamental frequency as objective index to construct and generate speech effect evaluation model, thereby measuring the generation effect of the reconstructed speech.
Claims (3)
1. The deep learning voice reconstruction method based on the smart phone acceleration sensor is characterized by comprising the following steps of:
step 1: collecting data;
playing the audio file by using a mobile phone;
collecting signals of an acceleration sensor, and recording the signals and corresponding time stamps;
step 2: data processing; performing linear interpolation, noise processing and feature extraction on the sensor signals;
step 2-1: linear interpolation;
locating all time points without acceleration data through time stamps, and filling the missing data by using linear interpolation; for acceleration data with the same time stamp, adopting a mean value taking means, wherein the mean value represents the acceleration data of the time stamp, so that each time stamp has only one acceleration data;
step 2-2: noise processing;
removing noise caused by gravity factors and hardware factors from sensor signals by using reference data under a silent condition; filtering with a high pass filter having a cut-off frequency of aHz;
step 2-3: feature extraction
Dividing the sensor signal into a plurality of segments with fixed overlap, setting the segment length and the overlap length as 256 and 64 respectively, windowing each segment by using a hamming window, calculating a frequency spectrum by short-time fourier transform (STFT) to obtain an STFT matrix, recording the amplitude and phase of each time and frequency, and converting into a corresponding spectrogram according to formula (1):
spectrogram{x(n)}=|STFT{x(n)}| 2 (1)
wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal;
converting the spectrogram into a Mel spectrogram, and realizing the frequency f and Mel scale f on the spectrogram corresponding to the acceleration sensor signal through the steps (2) and (3) mel Interconversion between:
finally, converting the spectrogram into a Mel spectrogram by using a formula (2) to obtain a characteristic-Mel spectrogram;
step 3: reconstructing voice;
generating a voice reconstruction model WMelGAN of an countermeasure network by adopting a multi-scale time-frequency domain based on wavelets, and converting the sensor data processed in the step 2 into a synthetic voice signal;
step 3-1: the wavelet-based multi-scale time-frequency domain generation of a speech reconstruction model of an countermeasure network includes three components: a generator, a multi-scale discriminator, and a wavelet discriminator;
the generator converts the Mel spectrogram into a synthetic voice signal, the generator is formed by a series of up-sampling transposed convolution layers, and a residual error network with a cavity convolution layer is arranged behind each transposed convolution layer;
different sub-discriminants of the multi-scale discriminant have model structures based on a convolution network and work on different data scales to judge the output results of generators of different scales; the network structure of each sub-arbiter is a down-sampling structure consisting of a layer 1-dimensional convolution, a layer 4-dimensional grouping convolution and a layer one-dimensional sum in sequence, and the input of the sub-arbiter comprises an original voice signal and a synthesized voice signal generated by a generating network device;
the wavelet discriminator decomposes the input voice signal into four sub-signals with different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using a stacked convolutional neural network; in the WMelGAN model, the generator and the plurality of discriminators train in a countermeasure mode, so that the audio generated by the generator reaches the effect that the discriminators cannot judge true or false, and finally, the generator is utilized to generate a final synthesized voice signal;
step 3-2: a loss function;
the countermeasure training between the generator and the arbiter is performed by setting a series of loss functions, the targets are as shown in formulas (4) and (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Representing the overall loss function of two discriminators, loss G Representing a loss function of the generator; the loss of the generator is composed of five parts, namely a loss of the multi-scale discriminator gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD ;
The overall loss of two discriminators is divided into multi-scale discriminators loss disc_TD And wavelet discriminant loss disc_WD The definition is:
wherein x represents an original voice signal waveform, s represents a Mel spectrogram, z represents a Gaussian noise vector, TD and WD respectively represent a multi-scale discriminator and a wavelet discriminator, and subscript k represents different scales; g (s, z) represents the generated speech signal;representing the desire;
multi-scale discriminant loss of generator gen_TD Wavelet discriminant loss of sum generator gen_WD Is defined as:
wherein the method comprises the steps ofThree multi-scale discriminators corresponding to different scales respectively;
mel loss mel The difference between the original voice waveform and the synthesized voice waveform is quantized by utilizing a multi-scale mel spectrogram, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Representing the F-norm, wherein MEL (·) represents the MEL-gram transform for a given speech signal;
step 4: a idiomatic evaluation system; evaluating the intelligibility and naturalness index of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system;
step 4-1: subjective evaluation system uses mean opinion score MOS as evaluation index
Step 4-2: an objective evaluation system;
step 4-2-1: measuring the difference between the synthesized voice and the original voice signal by adopting three objective indexes of peak signal-to-noise ratio PSNR, mel cepstrum distortion MCD and root mean square error F0RMSE of fundamental frequency; the intelligibility of the synthesized voice is measured by adopting dictation test accuracy ADT;
peak signal-to-noise ratio PSNR measures the ratio between the maximum possible power of a signal and the noise power affecting its quality:
wherein S is peak Representing peak speech signals, S representing speech signals, N representing noise signals;
step 4-2-2: the difference between the original speech signal and the synthesized speech signal between the mel-cepstrum features MFCCs is quantized using the mel-cepstrum distortion MCD, specifically, the mel-cepstrum distortion of the k-th frame is expressed as:
where r represents the original speech signal, M is the number of Mel filters, MC s (i, k) and MC r (i, k) are MFCC coefficients of the synthesized speech signal s and the original speech signal r;
omitting the subscript of MC, expressed as:
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
where X (k, m) represents the k-th frame fourier transform result, w, of an input speech frame with frequency index m n (m) represents an nth mel filter;
step 4-2-3: the difference between the fundamental frequency of the original speech and the fundamental frequency of the synthesized speech is compared using the root mean square error F0RMSE of the fundamental frequency, expressed as:
2. The smart phone acceleration sensor-based deep learning speech reconstruction method according to claim 1, wherein a=20.
3. The deep learning voice reconstruction method based on the smart phone acceleration sensor according to claim 1, wherein the number m=10 of mel filters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310387588.XA CN116386589A (en) | 2023-04-12 | 2023-04-12 | Deep learning voice reconstruction method based on smart phone acceleration sensor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310387588.XA CN116386589A (en) | 2023-04-12 | 2023-04-12 | Deep learning voice reconstruction method based on smart phone acceleration sensor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116386589A true CN116386589A (en) | 2023-07-04 |
Family
ID=86978564
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310387588.XA Pending CN116386589A (en) | 2023-04-12 | 2023-04-12 | Deep learning voice reconstruction method based on smart phone acceleration sensor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116386589A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117727329A (en) * | 2024-02-07 | 2024-03-19 | 深圳市科荣软件股份有限公司 | Multi-target monitoring method for intelligent supervision |
-
2023
- 2023-04-12 CN CN202310387588.XA patent/CN116386589A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117727329A (en) * | 2024-02-07 | 2024-03-19 | 深圳市科荣软件股份有限公司 | Multi-target monitoring method for intelligent supervision |
CN117727329B (en) * | 2024-02-07 | 2024-04-26 | 深圳市科荣软件股份有限公司 | Multi-target monitoring method for intelligent supervision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106486131B (en) | A kind of method and device of speech de-noising | |
CN101023469B (en) | Digital filtering method, digital filtering equipment | |
EP2064698B1 (en) | A method and a system for providing sound generation instructions | |
CN106531159B (en) | A kind of mobile phone source title method based on equipment background noise spectrum signature | |
CN113823293B (en) | Speaker recognition method and system based on voice enhancement | |
Murugappan et al. | DWT and MFCC based human emotional speech classification using LDA | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
CN116386589A (en) | Deep learning voice reconstruction method based on smart phone acceleration sensor | |
CN110136746B (en) | Method for identifying mobile phone source in additive noise environment based on fusion features | |
Shen et al. | RARS: Recognition of audio recording source based on residual neural network | |
CN112382302A (en) | Baby cry identification method and terminal equipment | |
US20230116052A1 (en) | Array geometry agnostic multi-channel personalized speech enhancement | |
Bonet et al. | Speech enhancement for wake-up-word detection in voice assistants | |
Wang et al. | Low pass filtering and bandwidth extension for robust anti-spoofing countermeasure against codec variabilities | |
CN117935789A (en) | Speech recognition method, system, equipment and storage medium | |
CN114512140A (en) | Voice enhancement method, device and equipment | |
CN117542373A (en) | Non-air conduction voice recovery system and method | |
Joy et al. | Deep scattering power spectrum features for robust speech recognition | |
Kaminski et al. | Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models | |
CN111261192A (en) | Audio detection method based on LSTM network, electronic equipment and storage medium | |
Verma et al. | Cell-phone identification from recompressed audio recordings | |
Fukuda et al. | Improved voice activity detection using static harmonic features | |
Zhang et al. | Deep scattering spectra with deep neural networks for acoustic scene classification tasks | |
Pan et al. | Cyclegan with dual adversarial loss for bone-conducted speech enhancement | |
Kuang et al. | A lightweight speech enhancement network fusing bone-and air-conducted speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |