CN116386589A - Deep learning voice reconstruction method based on smart phone acceleration sensor - Google Patents

Deep learning voice reconstruction method based on smart phone acceleration sensor Download PDF

Info

Publication number
CN116386589A
CN116386589A CN202310387588.XA CN202310387588A CN116386589A CN 116386589 A CN116386589 A CN 116386589A CN 202310387588 A CN202310387588 A CN 202310387588A CN 116386589 A CN116386589 A CN 116386589A
Authority
CN
China
Prior art keywords
loss
mel
voice
signal
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310387588.XA
Other languages
Chinese (zh)
Inventor
梁韵基
严笑凯
王梓哲
秦煜辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310387588.XA priority Critical patent/CN116386589A/en
Publication of CN116386589A publication Critical patent/CN116386589A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • User Interface Of Digital Computer (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses a deep learning voice reconstruction method based on a smart phone acceleration sensor, which comprises the steps of firstly collecting data, and collecting mainboard vibration signals caused by a smart phone loudspeaker in a plurality of frequency sampling modes by combining a plurality of motion sensors; then, data processing is carried out, and sensor signals are processed through the steps of linear interpolation, noise rejection, feature extraction and the like; then, carrying out voice reconstruction, and providing a voice reconstruction algorithm for generating an countermeasure network based on the multi-scale time-frequency domain of the wavelet to convert the preprocessed motion sensor data into voice waveform data; and finally, evaluating the effect, and evaluating the generated voice through subjective and objective indexes. The invention improves the high-frequency performance and the robustness of the voice synthesis, so that the synthesized voice is more similar to the original voice.

Description

Deep learning voice reconstruction method based on smart phone acceleration sensor
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a deep learning voice reconstruction method.
Background
Along with the research and development of the basic technology of the mobile internet, electronic commerce, social networks and new media industries are vigorously developed in the past ten years, and the consumer market of mobile intelligent terminals represented by intelligent mobile phones is expanded for the last ten years. According to the latest data of Statista of a statistical organization, the worldwide smart phone user scale reaches 66.4 hundred million by the third quarter of 2022, and accounts for 83.32% of the world population. The number of users of the smart phone is synchronously and explosive-type, and the number of users of the smart phone is an increasingly compact coupling relationship between daily life of human beings and intelligent equipment. The motion sensor is taken as an indispensable important component in the design paradigm of the modern mobile intelligent equipment, bears key responsibilities of sensing the external environment of the equipment, identifying the motion state of the equipment and reading interactive input of a user, and is widely applied to various mobile terminals. A motion sensor represented by an acceleration sensor is generally mounted on a smart phone motherboard, and is tightly coupled with core elements including a smart phone processor, a speaker, a microphone, and the like, so as to jointly serve the operation of the core system.
Within the handset, the speaker and a number of sensors are integrated on a circuit board, which can be regarded as an efficient solid transmission medium. When the speaker is in operation, sound vibrations propagate through the entire main board, which allows a motion sensor placed on the same surface to capture solid vibrations caused by the speaker. Moreover, since the motion sensor and speaker are integrated on the same circuit board in physical contact, in close proximity to each other, the voice signal emitted by the speaker will always have a significant impact on the motion sensor (such as a gyroscope and accelerometer) no matter how the smartphone is placed (on a desk or in a hand). These motion sensors are sensitive to vibration signals, so that the signals captured by them always contain acoustic vibrations from the phone speaker to the motherboard.
In general, previous studies expressed this as a classification problem and applied machine learning based solutions to construct mappings between features extracted from non-acoustic signals and words. Numerous studies have demonstrated the feasibility of identifying numbers, words, and even key words from sensor vibration signals. For example, university team research in Zhejiang has found that the sampling frequency of the smart phone built-in acceleration sensor released after 2018 is up to 500Hz, which covers almost the entire baseband (85-255 Hz) of adult voices. They propose a speech recognition system based on deep learning, which collects the speech signal sent by the loudspeaker through the low-authority spy application program, converts the speech signal into a spectrogram, and then uses DenseNet as a basic network to classify and recognize the speech information (text) carried by the spectrogram of the acceleration signal. Han et al propose a distributed side channel attack by using vibration signals captured by a sensor network (including a detector, an accelerometer, and a gyroscope), and in order to solve the problem of a lower sampling frequency of the sensor, they use a distributed form of TI-ADC (Time-Interleaved analog-to-digital converter) to approach the overall high sampling frequency while maintaining a lower node sampling rate. However, existing work can only recognize a few words with a very limited vocabulary.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a deep learning voice reconstruction method based on a smart phone acceleration sensor, which comprises the steps of firstly collecting data, and collecting mainboard vibration signals caused by a smart phone loudspeaker in a plurality of frequency sampling modes by combining a plurality of motion sensors; then, data processing is carried out, and sensor signals are processed through the steps of linear interpolation, noise rejection, feature extraction and the like; then, carrying out voice reconstruction, and providing a voice reconstruction algorithm for generating an countermeasure network based on the multi-scale time-frequency domain of the wavelet to convert the preprocessed motion sensor data into voice waveform data; and finally, evaluating the effect, and evaluating the generated voice through subjective and objective indexes. The invention improves the high-frequency performance and the robustness of the voice synthesis, so that the synthesized voice is more similar to the original voice.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
step 1: collecting data;
playing the audio file by using a mobile phone;
collecting signals of an acceleration sensor, and recording the signals and corresponding time stamps;
step 2: data processing; performing linear interpolation, noise processing and feature extraction on the sensor signals;
step 2-1: linear interpolation;
locating all time points without acceleration data through time stamps, and filling the missing data by using linear interpolation; for acceleration data with the same time stamp, adopting a mean value taking means, wherein the mean value represents the acceleration data of the time stamp, so that each time stamp has only one acceleration data;
step 2-2: noise processing;
removing noise caused by gravity factors and hardware factors from sensor signals by using reference data under a silent condition; filtering with a high pass filter having a cut-off frequency of aHz;
step 2-3: feature extraction
Dividing the sensor signal into a plurality of segments with fixed overlap, setting the segment length and the overlap length as 256 and 64 respectively, windowing each segment by using a hamming window, calculating a frequency spectrum by short-time fourier transform (STFT) to obtain an STFT matrix, recording the amplitude and phase of each time and frequency, and converting into a corresponding spectrogram according to formula (1):
spectrogram{x(n)}=|STFT{x(n)} 2 (1)
wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal;
converting the spectrogram into a Mel spectrogram, and realizing the frequency f and Mel scale f on the spectrogram corresponding to the acceleration sensor signal through the steps (2) and (3) mel Interconversion between:
Figure BDA0004174592340000031
Figure BDA0004174592340000032
finally, converting the spectrogram into a Mel spectrogram by using a formula (2) to obtain a characteristic-Mel spectrogram;
step 3: reconstructing voice;
generating a voice reconstruction model WMelGAN of an countermeasure network by adopting a multi-scale time-frequency domain based on wavelets, and converting the sensor data processed in the step 2 into a synthetic voice signal;
step 3-1: the wavelet-based multi-scale time-frequency domain generation of a speech reconstruction model of an countermeasure network includes three components: a generator, a multi-scale discriminator, and a wavelet discriminator;
the generator converts the Mel spectrogram into a synthetic voice signal, the generator is formed by a series of up-sampling transposed convolution layers, and a residual error network with a cavity convolution layer is arranged behind each transposed convolution layer;
different sub-discriminants of the multi-scale discriminant have model structures based on a convolution network and work on different data scales to judge the output results of generators of different scales; the network structure of each sub-arbiter is a down-sampling structure consisting of a layer 1-dimensional convolution, a layer 4-dimensional grouping convolution and a layer one-dimensional sum in sequence, and the input of the sub-arbiter comprises an original voice signal and a synthesized voice signal generated by a generating network device;
the wavelet discriminator decomposes the input voice signal into four sub-signals with different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using a stacked convolutional neural network; in the WMelGAN model, the generator and the plurality of discriminators train in a countermeasure mode, so that the audio generated by the generator reaches the effect that the discriminators cannot judge true or false, and finally, the generator is utilized to generate a final synthesized voice signal;
step 3-2: a loss function;
the countermeasure training between the generator and the arbiter is performed by setting a series of loss functions, the targets are as shown in formulas (4) and (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Representing the overall loss function of two discriminators, loss G Representing a loss function of the generator; the loss of the generator is composed of five parts, namely a loss of the multi-scale discriminator gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD
The overall loss of two discriminators is divided into multi-scale discriminators loss disc_TD And wavelet discriminant loss disc_WD The definition is:
Figure BDA0004174592340000041
Figure BDA0004174592340000042
wherein x represents an original voice signal waveform, s represents a Mel spectrogram, z represents a Gaussian noise vector, TD and WD respectively represent a multi-scale discriminator and a wavelet discriminator, and subscript k represents different scales; g (s, z) represents the generated speech signal;
Figure BDA0004174592340000043
representing the desire;
multi-scale discriminant loss of generator gen_TD Wavelet discriminant loss of sum generator gen_WD Is defined as:
Figure BDA0004174592340000044
Figure BDA0004174592340000045
wherein the method comprises the steps of
Figure BDA0004174592340000046
Three multi-scale discriminators corresponding to different scales respectively;
mel loss mel The difference between the original voice waveform and the synthesized voice waveform is quantized by utilizing a multi-scale mel spectrogram, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Representing the F-norm, wherein MEL (·) represents the MEL-gram transform for a given speech signal;
step 4: a idiomatic evaluation system; evaluating the intelligibility and naturalness index of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system;
step 4-1: subjective evaluation system uses mean opinion score MOS as evaluation index
Step 4-2: an objective evaluation system;
step 4-2-1: measuring the difference between the synthesized voice and the original voice signal by adopting three objective indexes of peak signal-to-noise ratio PSNR, mel cepstrum distortion MCD and root mean square error F0RMSE of fundamental frequency; the intelligibility of the synthesized voice is measured by adopting dictation test accuracy ADT;
peak signal-to-noise ratio PSNR measures the ratio between the maximum possible power of a signal and the noise power affecting its quality:
Figure BDA0004174592340000051
wherein S is peak Representing peak speech signals, S representing speech signals, N representing noise signals;
step 4-2-2: the difference between the original speech signal and the synthesized speech signal between the mel-cepstrum features MFCCs is quantized using the mel-cepstrum distortion MCD, specifically, the mel-cepstrum distortion of the k-th frame is expressed as:
Figure BDA0004174592340000052
where r represents the original speech signal, M is the number of Mel filters, MC s (i, k) and MC r (i, k) are MFCC coefficients of the synthesized speech signal s and the original speech signal r;
omitting the subscript of MC, expressed as:
Figure BDA0004174592340000053
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
Figure BDA0004174592340000054
where X (k, m) represents the k-th frame fourier transform result, w, of an input speech frame with frequency index m n (m) represents an nth mel filter;
step 4-2-3: the difference between the fundamental frequency of the original speech and the fundamental frequency of the synthesized speech is compared using the root mean square error F0RMSE of the fundamental frequency, expressed as:
Figure BDA0004174592340000055
where f0 represents the fundamental frequency characteristic of the original speech signal,
Figure BDA0004174592340000056
representing the fundamental frequency characteristics of the synthesized speech signal.
Preferably, the a=20.
Preferably, the number m=10 of mel filters.
The beneficial effects of the invention are as follows:
the data acquisition of the invention can obtain the sampling signals of the multi-audio file motion sensor in a plurality of frequency sampling modes in batches. The data processing implements a speaker independent universal sensor speech synthesis framework that reduces reliance on a particular speaker data set. The high-frequency performance and the robustness of the voice synthesis are improved by optimizing the model architecture and introducing the wavelet discriminator in the voice reconstruction, so that the synthesized voice is more similar to the original voice. The generated voice evaluation demonstrates that the smart phone motion sensor has the sensing capability of restoring the voice of the loudspeaker.
Drawings
Fig. 1 is a schematic frame diagram of a deep learning voice reconstruction method based on a smart phone acceleration sensor.
Fig. 2 is a schematic diagram of a speech reconstruction algorithm for generating an countermeasure network based on a multi-scale time-frequency domain of a wavelet in the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
The invention aims to provide a voice reconstruction method based on a smart phone acceleration sensor, which aims to solve the privacy safety problem contained in motion sensor data in the prior art and improve the quality and efficiency of voice synthesis.
The invention carries out voice reconstruction towards the signal of the built-in acceleration sensor of the intelligent mobile phone, establishes the internal association of the state between the built-in acceleration sensor and the speaker of the mobile phone, establishes the propagation model of the vibration signal between the acceleration sensor and the speaker of the mobile phone, and realizes the function of reconstructing the original voice signal from the accelerometer vibration caused by the speaker.
As shown in fig. 1, the present invention adopts the following technical scheme:
and (3) a step of: collecting data; and playing an audio file by utilizing a self-developed android APP, and selecting a plurality of motion sensors and a plurality of frequency sampling modes according to requirements to acquire mainboard vibration signals caused by a smart phone loudspeaker. The multi-audio file motion sensor sampling signal acquisition device has the beneficial effects that the multi-audio file motion sensor sampling signals under various frequency sampling modes can be obtained in batches.
And II: data processing; integrating a plurality of digital signal processing schemes, carrying out data preprocessing, noise analysis and elimination and data feature (Mel spectrogram) extraction on the motion sensor signals, carrying out noise modeling on the acceleration sensor signals, and separating out noise components in the perception data. The method has the advantages that a speaker-independent general sensor voice synthesis framework is realized, and the dependence on a specific speaker data set is reduced.
Thirdly,: reconstructing voice; as shown in fig. 2, a speech reconstruction algorithm for generating an countermeasure network based on a multi-scale time-frequency domain of wavelets is proposed. The sensor-voice mapping model based on the structure design of the generating countermeasure network converts the preprocessed motion sensor data into voice waveform data, a mapping function between denoising perception data and original voice data is established by using the countermeasure generating network, and a wavelet discriminator is introduced to improve the high-frequency effect of voice synthesis. The method has the advantages that the high-frequency performance and the robustness of the voice synthesis are improved through optimizing the model architecture and introducing the wavelet discriminator, so that the synthesized voice is closer to the original voice.
Fourth, the method comprises the following steps: and generating a voice evaluation system. And evaluating indexes such as the intelligibility, the naturalness and the like of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system. The intelligent mobile phone motion sensor has the beneficial effects that the intelligent mobile phone motion sensor has the perception capability of restoring the voice of the loudspeaker.
Specific examples:
the specific steps of the invention are as follows:
step 1: and (5) data acquisition.
And playing an audio file by using a self-developed android APP (application) sensor, selecting a plurality of motion sensors and a plurality of frequency sampling modes according to requirements, collecting a mainboard vibration signal caused by a smart phone loudspeaker, and storing the mainboard vibration signal to a local place. Specifically, the application program can acquire all audio files locally stored in the mobile phone through an interface provided by android, and display the audio files on an application homepage. The user can select the audio of the sensor data to be collected on the main page, and after clicking to start collection, the application program plays the audio and records the sensor signal. Specifically, for each piece of audio, the application program acquires a linear acceleration sensor object, an acceleration sensor object, a gyro sensor object, registers the linear acceleration sensor object, the acceleration sensor object, and the gyro sensor object as listeners (listeners), then creates a media play object (MediaPlayer), and starts playing the audio; during this time, the sensor will record each vibration signal and the corresponding time stamp; when the audio playing is finished, the MediaPlayer returns an ending signal, and the acquisition of the sensor signal is ended by logging out the identity of a listener of the motion sensor object; and finally, writing the linear acceleration signal, the acceleration signal and the gyroscope signal into the same-name CSV file of the audio frequency respectively. The application program can acquire all audio files locally stored in the mobile phone through an interface provided by android and display the audio files on an application homepage. The user can select the audio of the sensor data to be collected on the main page, and after clicking to start collection, the application program plays the audio and records the sensor signal. Specifically, for each piece of audio, the application program acquires a linear acceleration sensor object, an acceleration sensor object, a gyro sensor object, registers the linear acceleration sensor object, the acceleration sensor object, and the gyro sensor object as listeners (listeners), then creates a media play object (MediaPlayer), and starts playing the audio; during this time, the sensor will record each vibration signal and the corresponding time stamp; when the audio playing is finished, the MediaPlayer returns an ending signal, and the acquisition of the sensor signal is ended by logging out the identity of a listener of the motion sensor object; and finally, writing the linear acceleration signal, the acceleration signal and the gyroscope signal into the same-name CSV file of the audio frequency respectively. The user may select a variety of frequency sampling modes to capture the output signal of the motion sensor by himself.
Step 2: and (5) data processing.
Integrating a plurality of digital signal processing schemes, and carrying out data preprocessing, noise analysis and elimination and data feature (Mel spectrogram) extraction on the motion sensor signals. (1) The invention locates all time points without acceleration data by time stamp and fills up missing data by linear interpolation; for acceleration data with the same time stamp, the invention adopts a mean value taking means, and the mean value is used for representing the acceleration data of the time stamp, so that each time stamp has only one acceleration data. (2) Noise processing, namely eliminating noise caused by gravity factors and hardware factors in sensor signals by using reference data under a silence condition; high pass filtering with a cut-off frequency of 50Hz is used to remove the effects of human activity. (3) The invention divides the sensor signal into a plurality of short segments with fixed overlap, the lengths of the segments and the overlap are respectively set to 256 and 64, each segment is windowed by using a Hamming window, and the frequency spectrum of each segment is calculated by fast Fourier transform (STFT) to obtain an STFT matrix. The matrix records the amplitude and phase for each time and frequency and is based on the following equation:
spectrogram{x(n)}=|STFT{x(n)} 2 (1)
and converting into a corresponding spectrogram, wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal. The horizontal axis x of the spectrogram is time, the vertical axis y is frequency, and the corresponding value of (x, y) represents the magnitude of the frequency y at time x. In order to enable the frequency of the spectrogram to conform to the mode of logarithmic distribution of the frequency of human ears, the spectrogram is converted into the Mel spectrogram. The mutual conversion between the frequency (Hz) and mel scale (mel) on the corresponding spectrogram of the acceleration sensor signal is achieved by using the following two formulas:
Figure BDA0004174592340000081
Figure BDA0004174592340000082
finally, a Mel filter is used to obtain a feature-Mel spectrogram.
Step 3: and (5) reconstructing voice.
The preprocessed motion sensor data is converted into speech waveform data based on generating a sensor-to-speech mapping model against the network structure design. As shown in fig. 2, the WMelGAN model constructed based on the idea of generating an countermeasure network of the present invention includes three major core components: a generator, a multi-scale discriminant, and a wavelet discriminant. The generator aims at converting the mel-language spectrogram into an understandable voice audio waveform signal, converting the mel-language spectrogram sequence into a voice waveform signal with higher data quantity through a series of up-sampling transposed convolution networks, and arranging a residual network with a cavity convolution layer behind each transposed convolution module to obtain a larger receptive field, wherein the wider receptive field is helpful for capturing cross-space correlation between long vectors and learning acoustic structural characteristics. Different sub-discriminants of the multi-scale discriminant have similar model structures based on a convolutional network, and work on different data scales to judge the fitting effect of the generators of different scales. The network structure of each time domain discriminator is a downsampling structure formed by a layer 1-dimensional convolution and a layer 4-dimensional packet convolution, and the input of the network structure is formed by an original voice signal and a generated voice signal generated by a generating network. The wavelet discriminator decomposes the input voice signal into four sub-signals of different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using the stacked convolutional neural network. In the framework of WMelGAN, the generator and the plurality of discriminators train in a countermeasure manner, so that the audio generated by the generator achieves the effect that the set of discriminators cannot judge whether the audio is true or false, and finally, the generator is utilized to generate a high-quality speech signal with high intelligibility.
The invention provides a wavelet-based multi-scale time-frequency domain generation countermeasure network voice reconstruction algorithm, which carries out countermeasure training between a generator and a discriminator by setting a series of loss functions, and targets are shown as formula (4) (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Loss function representing discrimination network, loss G Representing a loss function of the generated network. Loss of discriminatorThe function is calculated from the characteristics of the multi-scale time domain discrimination network and the wavelet discrimination network output, and this loss function minimizes the L1 distance between the original speech signal and the discrimination network feature map that generated the speech signal. The loss of the generated network consists of five parts, namely loss of the multi-scale time domain discriminator respectively gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD . Multi-scale time domain discrimination network loss of discriminator disc_TD Wavelet discrimination network loss of sum discriminator disc_WD Is defined as:
Figure BDA0004174592340000091
Figure BDA0004174592340000092
where x represents the original waveform, s represents the acoustic features (e.g., mel-pattern), z represents gaussian noise vectors, and TD and WD represent the multi-scale time domain discrimination network and wavelet discrimination network, respectively. Multi-scale time domain discrimination network loss of generator gen_TD Wavelet discrimination network loss of sum generator gen_WD Is defined as:
Figure BDA0004174592340000093
Figure BDA0004174592340000094
wherein the method comprises the steps of
Figure BDA0004174592340000095
Representing the losses generated by the three time domain discriminators of different scales, respectively. Mel loss mel Is made of multiple scalesThe mel-gram quantifies the gap between the original speech waveform and the synthesized speech waveform, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Represents the F-norm, where MEL (·) represents the MEL-gram transform for a given speech signal.
Step 4: a speech evaluation is generated.
And evaluating indexes such as the intelligibility, the naturalness and the like of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system. Subjective evaluation system, refer to the field of speech synthesis. Since the final service object of speech synthesis is human, a general speech synthesis model uses a Mean Opinion Score (MOS) as an evaluation index. The research system refers to design of a subjective evaluation system in evaluation related to voice effect; objective evaluation system, refer to the human behavior recognition and signal processing field. Specifically, the invention adopts three objective indexes of peak signal-to-noise ratio (PSNR), mel cepstrum distortion (Mel-Cepstral Distortion, MCD) and root mean square error (F0 Root Mean Square Error, F0 RMSE) of fundamental frequency to measure the difference between the synthesized voice and the reference voice signals. In addition, the invention also adopts dictation test accuracy (Accuracy of Dictation Test, ADT) to measure the intelligibility of the synthesized voice. In order to quantify the quality of the synthesized speech signal reconstructed from the acceleration sensor signal, the peak signal-to-noise ratio (PSNR) may measure the ratio between the maximum possible power of the signal and the noise power affecting its quality:
Figure BDA0004174592340000101
in order to measure the distance between the original speech signal and the synthesized speech signal, the present invention proposes to quantify the difference between the mel-frequency cepstrum features (MFCCs) of these 2 signals using mel-frequency cepstrum distortion (MCD). Specifically, mel-cepstrum distortion of the kth frame can be expressed as:
Figure BDA0004174592340000102
where s represents the synthesized speech signal, r represents the original speech signal, M is the number of mel filters, m=10, mc in the present invention s (i, k) and MC r (i, k) are the MFCC coefficients of the synthesized speech signal s and the original speech signal r. For simplicity, after omitting the subscript of MC, it may be expressed as:
Figure BDA0004174592340000103
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
Figure BDA0004174592340000104
where X (k, m) represents the k-th frame Fourier transform result, w, of an input speech frame with a frequency index of m n (m) represents an nth Mel filter. Clearly, the lower the value of mel-frequency cepstrum distortion (MCD), the more nearly the synthesized audio reconstructed from the built-in acceleration sensor is to the original speech signal. The invention uses the root mean square error (F0 RMSE) of the fundamental frequency to compare the difference between the fundamental frequency of the original voice and the fundamental frequency of the synthesized voice, and the lower the value is, the closer the fundamental frequency outline between two voice signals is, and the better the generating effect is. Specifically, the root mean square error of the fundamental frequency is expressed as:
Figure BDA0004174592340000111
where f0 represents the fundamental frequency characteristic of the original speech signal,
Figure BDA0004174592340000112
then the fundamental frequency characteristic of the synthesized speech signal is represented.
The invention uses dictation test accuracy and average opinion score as main index, peak signal-to-noise ratio, mel cepstrum distortion and root mean square error of fundamental frequency as objective index to construct and generate speech effect evaluation model, thereby measuring the generation effect of the reconstructed speech.

Claims (3)

1. The deep learning voice reconstruction method based on the smart phone acceleration sensor is characterized by comprising the following steps of:
step 1: collecting data;
playing the audio file by using a mobile phone;
collecting signals of an acceleration sensor, and recording the signals and corresponding time stamps;
step 2: data processing; performing linear interpolation, noise processing and feature extraction on the sensor signals;
step 2-1: linear interpolation;
locating all time points without acceleration data through time stamps, and filling the missing data by using linear interpolation; for acceleration data with the same time stamp, adopting a mean value taking means, wherein the mean value represents the acceleration data of the time stamp, so that each time stamp has only one acceleration data;
step 2-2: noise processing;
removing noise caused by gravity factors and hardware factors from sensor signals by using reference data under a silent condition; filtering with a high pass filter having a cut-off frequency of aHz;
step 2-3: feature extraction
Dividing the sensor signal into a plurality of segments with fixed overlap, setting the segment length and the overlap length as 256 and 64 respectively, windowing each segment by using a hamming window, calculating a frequency spectrum by short-time fourier transform (STFT) to obtain an STFT matrix, recording the amplitude and phase of each time and frequency, and converting into a corresponding spectrogram according to formula (1):
spectrogram{x(n)}=|STFT{x(n)}| 2 (1)
wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal;
converting the spectrogram into a Mel spectrogram, and realizing the frequency f and Mel scale f on the spectrogram corresponding to the acceleration sensor signal through the steps (2) and (3) mel Interconversion between:
Figure FDA0004174592330000011
Figure FDA0004174592330000012
finally, converting the spectrogram into a Mel spectrogram by using a formula (2) to obtain a characteristic-Mel spectrogram;
step 3: reconstructing voice;
generating a voice reconstruction model WMelGAN of an countermeasure network by adopting a multi-scale time-frequency domain based on wavelets, and converting the sensor data processed in the step 2 into a synthetic voice signal;
step 3-1: the wavelet-based multi-scale time-frequency domain generation of a speech reconstruction model of an countermeasure network includes three components: a generator, a multi-scale discriminator, and a wavelet discriminator;
the generator converts the Mel spectrogram into a synthetic voice signal, the generator is formed by a series of up-sampling transposed convolution layers, and a residual error network with a cavity convolution layer is arranged behind each transposed convolution layer;
different sub-discriminants of the multi-scale discriminant have model structures based on a convolution network and work on different data scales to judge the output results of generators of different scales; the network structure of each sub-arbiter is a down-sampling structure consisting of a layer 1-dimensional convolution, a layer 4-dimensional grouping convolution and a layer one-dimensional sum in sequence, and the input of the sub-arbiter comprises an original voice signal and a synthesized voice signal generated by a generating network device;
the wavelet discriminator decomposes the input voice signal into four sub-signals with different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using a stacked convolutional neural network; in the WMelGAN model, the generator and the plurality of discriminators train in a countermeasure mode, so that the audio generated by the generator reaches the effect that the discriminators cannot judge true or false, and finally, the generator is utilized to generate a final synthesized voice signal;
step 3-2: a loss function;
the countermeasure training between the generator and the arbiter is performed by setting a series of loss functions, the targets are as shown in formulas (4) and (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Representing the overall loss function of two discriminators, loss G Representing a loss function of the generator; the loss of the generator is composed of five parts, namely a loss of the multi-scale discriminator gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD
The overall loss of two discriminators is divided into multi-scale discriminators loss disc_TD And wavelet discriminant loss disc_WD The definition is:
Figure FDA0004174592330000021
Figure FDA0004174592330000022
wherein x represents an original voice signal waveform, s represents a Mel spectrogram, z represents a Gaussian noise vector, TD and WD respectively represent a multi-scale discriminator and a wavelet discriminator, and subscript k represents different scales; g (s, z) represents the generated speech signal;
Figure FDA0004174592330000031
representing the desire;
multi-scale discriminant loss of generator gen_TD Wavelet discriminant loss of sum generator gen_WD Is defined as:
Figure FDA0004174592330000032
Figure FDA0004174592330000033
wherein the method comprises the steps of
Figure FDA0004174592330000034
Three multi-scale discriminators corresponding to different scales respectively;
mel loss mel The difference between the original voice waveform and the synthesized voice waveform is quantized by utilizing a multi-scale mel spectrogram, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Representing the F-norm, wherein MEL (·) represents the MEL-gram transform for a given speech signal;
step 4: a idiomatic evaluation system; evaluating the intelligibility and naturalness index of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system;
step 4-1: subjective evaluation system uses mean opinion score MOS as evaluation index
Step 4-2: an objective evaluation system;
step 4-2-1: measuring the difference between the synthesized voice and the original voice signal by adopting three objective indexes of peak signal-to-noise ratio PSNR, mel cepstrum distortion MCD and root mean square error F0RMSE of fundamental frequency; the intelligibility of the synthesized voice is measured by adopting dictation test accuracy ADT;
peak signal-to-noise ratio PSNR measures the ratio between the maximum possible power of a signal and the noise power affecting its quality:
Figure FDA0004174592330000035
wherein S is peak Representing peak speech signals, S representing speech signals, N representing noise signals;
step 4-2-2: the difference between the original speech signal and the synthesized speech signal between the mel-cepstrum features MFCCs is quantized using the mel-cepstrum distortion MCD, specifically, the mel-cepstrum distortion of the k-th frame is expressed as:
Figure FDA0004174592330000036
where r represents the original speech signal, M is the number of Mel filters, MC s (i, k) and MC r (i, k) are MFCC coefficients of the synthesized speech signal s and the original speech signal r;
omitting the subscript of MC, expressed as:
Figure FDA0004174592330000041
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
Figure FDA0004174592330000042
where X (k, m) represents the k-th frame fourier transform result, w, of an input speech frame with frequency index m n (m) represents an nth mel filter;
step 4-2-3: the difference between the fundamental frequency of the original speech and the fundamental frequency of the synthesized speech is compared using the root mean square error F0RMSE of the fundamental frequency, expressed as:
Figure FDA0004174592330000043
where f0 represents the fundamental frequency characteristic of the original speech signal,
Figure FDA0004174592330000044
representing the fundamental frequency characteristics of the synthesized speech signal.
2. The smart phone acceleration sensor-based deep learning speech reconstruction method according to claim 1, wherein a=20.
3. The deep learning voice reconstruction method based on the smart phone acceleration sensor according to claim 1, wherein the number m=10 of mel filters.
CN202310387588.XA 2023-04-12 2023-04-12 Deep learning voice reconstruction method based on smart phone acceleration sensor Pending CN116386589A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310387588.XA CN116386589A (en) 2023-04-12 2023-04-12 Deep learning voice reconstruction method based on smart phone acceleration sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310387588.XA CN116386589A (en) 2023-04-12 2023-04-12 Deep learning voice reconstruction method based on smart phone acceleration sensor

Publications (1)

Publication Number Publication Date
CN116386589A true CN116386589A (en) 2023-07-04

Family

ID=86978564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310387588.XA Pending CN116386589A (en) 2023-04-12 2023-04-12 Deep learning voice reconstruction method based on smart phone acceleration sensor

Country Status (1)

Country Link
CN (1) CN116386589A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727329A (en) * 2024-02-07 2024-03-19 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727329A (en) * 2024-02-07 2024-03-19 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision
CN117727329B (en) * 2024-02-07 2024-04-26 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision

Similar Documents

Publication Publication Date Title
CN106486131B (en) A kind of method and device of speech de-noising
CN101023469B (en) Digital filtering method, digital filtering equipment
EP2064698B1 (en) A method and a system for providing sound generation instructions
CN106531159B (en) A kind of mobile phone source title method based on equipment background noise spectrum signature
CN113823293B (en) Speaker recognition method and system based on voice enhancement
Murugappan et al. DWT and MFCC based human emotional speech classification using LDA
CN112185342A (en) Voice conversion and model training method, device and system and storage medium
CN116386589A (en) Deep learning voice reconstruction method based on smart phone acceleration sensor
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
Shen et al. RARS: Recognition of audio recording source based on residual neural network
CN112382302A (en) Baby cry identification method and terminal equipment
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
Bonet et al. Speech enhancement for wake-up-word detection in voice assistants
Wang et al. Low pass filtering and bandwidth extension for robust anti-spoofing countermeasure against codec variabilities
CN117935789A (en) Speech recognition method, system, equipment and storage medium
CN114512140A (en) Voice enhancement method, device and equipment
CN117542373A (en) Non-air conduction voice recovery system and method
Joy et al. Deep scattering power spectrum features for robust speech recognition
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN111261192A (en) Audio detection method based on LSTM network, electronic equipment and storage medium
Verma et al. Cell-phone identification from recompressed audio recordings
Fukuda et al. Improved voice activity detection using static harmonic features
Zhang et al. Deep scattering spectra with deep neural networks for acoustic scene classification tasks
Pan et al. Cyclegan with dual adversarial loss for bone-conducted speech enhancement
Kuang et al. A lightweight speech enhancement network fusing bone-and air-conducted speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination