CN116386589A - A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor - Google Patents

A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor Download PDF

Info

Publication number
CN116386589A
CN116386589A CN202310387588.XA CN202310387588A CN116386589A CN 116386589 A CN116386589 A CN 116386589A CN 202310387588 A CN202310387588 A CN 202310387588A CN 116386589 A CN116386589 A CN 116386589A
Authority
CN
China
Prior art keywords
loss
mel
signal
voice
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310387588.XA
Other languages
Chinese (zh)
Inventor
梁韵基
严笑凯
王梓哲
秦煜辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202310387588.XA priority Critical patent/CN116386589A/en
Publication of CN116386589A publication Critical patent/CN116386589A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • User Interface Of Digital Computer (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Telephone Function (AREA)

Abstract

The invention discloses a deep learning voice reconstruction method based on a smart phone acceleration sensor, which comprises the steps of firstly collecting data, and collecting mainboard vibration signals caused by a smart phone loudspeaker in a plurality of frequency sampling modes by combining a plurality of motion sensors; then, data processing is carried out, and sensor signals are processed through the steps of linear interpolation, noise rejection, feature extraction and the like; then, carrying out voice reconstruction, and providing a voice reconstruction algorithm for generating an countermeasure network based on the multi-scale time-frequency domain of the wavelet to convert the preprocessed motion sensor data into voice waveform data; and finally, evaluating the effect, and evaluating the generated voice through subjective and objective indexes. The invention improves the high-frequency performance and the robustness of the voice synthesis, so that the synthesized voice is more similar to the original voice.

Description

一种基于智能手机加速度传感器的深度学习语音重建方法A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor

技术领域technical field

本发明属于人工智能技术领域,具体涉及一种深度学习语音重建方法。The invention belongs to the technical field of artificial intelligence, and in particular relates to a deep learning speech reconstruction method.

背景技术Background technique

伴随移动互联网基础技术的研究推进,电子商务、社交网络与新媒体产业在过去十余年间蓬勃发展,以智能手机为代表的移动智能终端消费市场开启了近十年的繁荣。根据统计机构Statista的最新数据,截至2022年第三季度,全世界智能手机用户规模达到66.4亿,约占世界人口的83.32%。与智能手机用户数量同步爆发式增长的,是人类日常生活与智能设备之间日益紧密的耦合关系。运动传感器作为现代移动智能设备设计范式中不可或缺的重要组成部分,承担了感知设备外部环境、识别设备运动状态、读取用户交互输入的关键职责,广泛应用于多种移动终端上。以加速度传感器为代表的运动传感器一般搭载于智能手机主板之上,与包括智能手机处理器、扬声器、麦克风等在内的核心元件紧密耦合,共同服务于核心系统的运转。With the advancement of research on basic technologies of the mobile Internet, e-commerce, social networking and new media industries have flourished in the past ten years, and the mobile intelligent terminal consumer market represented by smartphones has opened up nearly a decade of prosperity. According to the latest data from the statistical agency Statista, as of the third quarter of 2022, the number of smartphone users worldwide will reach 6.64 billion, accounting for about 83.32% of the world's population. Simultaneously with the explosive growth of the number of smartphone users, is the increasingly tight coupling between human daily life and smart devices. As an indispensable and important part of the modern mobile smart device design paradigm, the motion sensor undertakes the key responsibilities of sensing the external environment of the device, identifying the motion state of the device, and reading user interaction input, and is widely used in a variety of mobile terminals. Motion sensors represented by acceleration sensors are generally mounted on the motherboard of smartphones, and are tightly coupled with core components including smartphone processors, speakers, microphones, etc., and jointly serve the operation of the core system.

在手机内部,扬声器和众多的传感器都被集成在一块电路板上,这种主板可以视为一种高效的固体传输介质。当扬声器工作时,声音振动通过整个主板传播,这使得放置在同一表面的运动传感器可以捕捉到扬声器引起的固体振动。而且由于运动传感器和扬声器是集成在物理接触的同一电路板上的,彼此非常接近,因此无论智能手机以何种方式放置(桌上或手中),扬声器发出的语音信号总是会对运动传感器(如陀螺仪和加速度计)产生重大影响。这些运动传感器对振动信号较为灵敏,因此其所捕获的信号中总是能蕴含着来自手机扬声器给主板带来的声学振动。Inside the phone, the speaker and numerous sensors are integrated on a circuit board, which can be regarded as an efficient solid transmission medium. When the speaker is working, sound vibrations are transmitted through the entire motherboard, which allows a motion sensor placed on the same surface to capture the solid-state vibration caused by the speaker. And since the motion sensor and the speaker are integrated on the same circuit board in physical contact, in close proximity to each other, no matter how the smartphone is placed (on a table or in the hand), the voice signal from the speaker will always respond to the motion sensor ( such as gyroscopes and accelerometers) have a significant impact. These motion sensors are more sensitive to vibration signals, so the signals captured by them always contain the acoustic vibrations brought to the motherboard by the speakers of the mobile phone.

一般来说,之前的研究将其表述为一个分类问题,并应用基于机器学习的解决方案来构建从非声学信号和单词提取的特征之间的映射。大量的研究已经证明了从传感器振动信号中识别数字、单词甚至关键词汇的可行性。例如浙江大学团队研究发现2018年之后发布的智能手机内置的加速度传感器的采样频率高达500Hz,它几乎覆盖了成人语音的整个基频带(85-255Hz)。他们提出了一种基于深度学习的语音识别系统,此系统通过低权限的间谍应用程序收集扬声器发出的语音信号,把语音信号转化为语谱图后,利用DenseNet作为基础网络,对加速度信号谱图所携带的语音信息(文本)进行分类识别。Han等人通过利用传感器网络(包括检波器、加速度计和陀螺仪)捕获的振动信号,提出了一种分布式侧信道攻击,为了解决传感器采样频率较低的问题,他们采用了TI-ADC(Time-Interleaved模数转换器)的分布形式来逼近整体的高采样频率,同时保持了较低的节点采样率。然而,现有的工作只能识别词汇量非常有限的几个单词。In general, previous studies formulated it as a classification problem and applied machine learning-based solutions to construct a mapping between features extracted from non-acoustic signals and words. A large number of studies have demonstrated the feasibility of recognizing numbers, words and even key words from sensor vibration signals. For example, the Zhejiang University team found that the sampling frequency of the built-in accelerometer of the smart phone released after 2018 is as high as 500Hz, which almost covers the entire base frequency band (85-255Hz) of adult speech. They proposed a speech recognition system based on deep learning. This system collects the speech signal from the speaker through a low-privilege spy application, converts the speech signal into a spectrogram, and uses DenseNet as the basic network to analyze the acceleration signal spectrogram. The carried voice information (text) is classified and recognized. Han et al. proposed a distributed side-channel attack by using the vibration signals captured by the sensor network (including geophones, accelerometers, and gyroscopes). In order to solve the problem of low sensor sampling frequency, they adopted TI-ADC ( Time-Interleaved analog-to-digital converter) distribution form to approach the overall high sampling frequency, while maintaining a low node sampling rate. However, existing work can only recognize a few words from a very limited vocabulary.

发明内容Contents of the invention

为了克服现有技术的不足,本发明提供了一种基于智能手机加速度传感器的深度学习语音重建方法,首先进行数据采集,结合多种运动传感器以多种频率采样模式对智能手机扬声器引起的主板振动信号进行采集;然后进行数据处理,通过线性插值、噪声剔除、特征提取等步骤对传感器信号进行处理;接下来进行语音重建,提出了基于小波的多尺度时频域生成对抗网络的语音重建算法将预处理后的运动传感器数据转化为语音波形数据;最后进行效果评价,通过主客观指标对生成语音进行评估。本发明提升了语音合成的高频表现和鲁棒性,使得合成语音更接近原始语音。In order to overcome the deficiencies in the prior art, the present invention provides a deep learning voice reconstruction method based on the acceleration sensor of the smart phone. Firstly, the data is collected, and the vibration of the motherboard caused by the speaker of the smart phone is analyzed in combination with a variety of motion sensors in a variety of frequency sampling modes. The signal is collected; then the data is processed, and the sensor signal is processed through linear interpolation, noise elimination, feature extraction and other steps; then the speech reconstruction is carried out, and a speech reconstruction algorithm based on wavelet-based multi-scale time-frequency domain generative confrontation network is proposed. The preprocessed motion sensor data is converted into speech waveform data; finally, the effect evaluation is carried out, and the generated speech is evaluated through subjective and objective indicators. The invention improves the high-frequency performance and robustness of speech synthesis, making the synthesized speech closer to the original speech.

本发明解决其技术问题所采用的技术方案包括如下步骤:The technical solution adopted by the present invention to solve its technical problems comprises the steps:

步骤1:数据采集;Step 1: Data collection;

使用手机播放音频文件;Use the mobile phone to play audio files;

采集加速度传感器的信号,并记录该信号与对应的时间戳;Collect the signal of the acceleration sensor, and record the signal and the corresponding time stamp;

步骤2:数据处理;对传感器信号进行线性插值、噪声处理和特征提取;Step 2: Data processing; perform linear interpolation, noise processing and feature extraction on sensor signals;

步骤2-1:线性插值;Step 2-1: Linear interpolation;

通过时间戳定位所有没有加速度数据的时间点,使用线性插值来填补缺失的数据;对于具有同一个时间戳的加速度数据,采用取均值的手段,以此均值代表这个时间戳的加速度数据,从而使得每个时间戳都有且仅有一个加速度数据;Locate all time points without acceleration data by timestamp, and use linear interpolation to fill in the missing data; for acceleration data with the same timestamp, use the means of taking the mean value, and use this mean value to represent the acceleration data of this timestamp, so that Each timestamp has one and only one acceleration data;

步骤2-2:噪声处理;Step 2-2: noise processing;

使用静默条件下的基准数据剔除传感器信号中由于重力因素和硬件因素导致的噪声;使用截止频率为aHz的高通滤波器进行滤波;Use the reference data under silent conditions to remove the noise caused by gravity and hardware factors in the sensor signal; use a high-pass filter with a cut-off frequency of aHz for filtering;

步骤2-3:特征提取Step 2-3: Feature Extraction

将传感器信号分成多个具有固定重叠的片段,片段长度和重叠的长度分别设置为256和64,并使用汉明窗对每个片段进行窗口化,通过短时傅里叶变换STFT计算频谱得到STFT矩阵,该矩阵记录每个时间和频率的幅度与相位,并根据式(1)转换为对应的语谱图:The sensor signal is divided into multiple segments with fixed overlap, the segment length and the overlap length are set to 256 and 64 respectively, and the Hamming window is used to window each segment, and the spectrum is calculated by the short-time Fourier transform STFT to obtain the STFT matrix, which records the amplitude and phase of each time and frequency, and converts it into a corresponding spectrogram according to formula (1):

spectrogram{x(n)}=|STFT{x(n)}2 (1)spectrogram{x(n)}=|STFT{x(n)} 2 (1)

其中x(n)代表加速度传感器信号,STFT{x(n)}代表加速度传感器信号所对应的STFT矩阵;Where x(n) represents the acceleration sensor signal, and STFT{x(n)} represents the STFT matrix corresponding to the acceleration sensor signal;

将语谱图转换为梅尔语谱图,通过式(2)和(3)实现加速度传感器信号对应语谱图上的频率f与梅尔尺度fmel之间的相互转换:The spectrogram is converted into a Mel spectrogram, and the mutual conversion between the frequency f on the spectrogram corresponding to the acceleration sensor signal and the Mel scale f mel is realized by formulas (2) and (3):

Figure BDA0004174592340000031
Figure BDA0004174592340000031

Figure BDA0004174592340000032
Figure BDA0004174592340000032

最终使用公式(2)进行语谱图到梅尔语谱图的转换,得到特征——梅尔语谱图;Finally, the formula (2) is used to convert the spectrogram to the Mel spectrogram to obtain the feature - the Mel spectrogram;

步骤3:语音重建;Step 3: Speech reconstruction;

采用基于小波的多尺度时频域生成对抗网络的语音重建模型WMelGAN将经过步骤2处理后的传感器数据转化为合成语音信号;The speech reconstruction model WMelGAN based on wavelet-based multi-scale time-frequency domain generative confrontation network is used to convert the sensor data processed in step 2 into a synthetic speech signal;

步骤3-1:基于小波的多尺度时频域生成对抗网络的语音重建模型包括三个组件:生成器、多尺度判别器和小波判别器;Step 3-1: The speech reconstruction model of wavelet-based multi-scale time-frequency domain generative adversarial network includes three components: generator, multi-scale discriminator and wavelet discriminator;

生成器将梅尔语谱图转换为合成语音信号,生成器由一系列上采样转置卷积层构成,每个转置卷积层后设置有带有空洞卷积层的残差网络;The generator converts the mel spectrogram into a synthetic speech signal. The generator consists of a series of upsampled transposed convolutional layers, and each transposed convolutional layer is followed by a residual network with a dilated convolutional layer;

多尺度判别器的不同子判别器具有基于卷积网络的模型结构,工作在不同的数据尺度上以对不同尺度的生成器输出结果进行判断;每个子判别器的网络结构依次是由一层1维卷积、4层分组卷积和一层一维及构成的降采样结构,其输入包括原始语音信号和由生成网器生成的合成语音信号;The different sub-discriminators of the multi-scale discriminator have a model structure based on a convolutional network, and work on different data scales to judge the output results of generators of different scales; the network structure of each sub-discriminator is sequentially composed of a layer 1 One-dimensional convolution, 4-layer group convolution and a one-dimensional and one-dimensional downsampling structure, the input of which includes the original speech signal and the synthetic speech signal generated by the generator network;

小波判别器通过三次小波分解,将输入语音信号分解为四个不同频带的子信号,并利用堆叠的卷积神经网络对生成效果进行评估判断;在WMelGAN模型中,生成器和多个判别器以对抗的方式进行训练,使得生成器生成的音频达到判别器无法判断真假的效果,最后利用生成器生成最终合成语音信号;The wavelet discriminator decomposes the input speech signal into four sub-signals of different frequency bands through cubic wavelet decomposition, and uses the stacked convolutional neural network to evaluate and judge the generation effect; in the WMelGAN model, the generator and multiple discriminators use The training is conducted in an adversarial way, so that the audio generated by the generator reaches the effect that the discriminator cannot judge whether it is true or false, and finally the generator is used to generate the final synthesized speech signal;

步骤3-2:损失函数;Step 3-2: loss function;

通过设置一系列损失函数进行生成器和判别器之间的对抗训练,目标如公式(4)和(5)所示:By setting a series of loss functions for confrontation training between the generator and the discriminator, the goals are shown in formulas (4) and (5):

lossD=lossdisc_TD+lossdisc_WD (4)loss D = loss disc_TD + loss disc_WD (4)

lossG=lossgen_TD+lossgen_WD+lossmel*45+(lossfeature_TD+lossfeature_WD)*2 (5)loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)

其中lossD代表两个判别器的整体损失函数,lossG代表生成器的损失函数;生成器的损失由五部分构成,分别是多尺度判别器损失lossgen_TD、小波判别器损失lossgen_WD、梅尔损失lossmel、多尺度时域判别器特征图损失lossfeature_TD、小波判别器特征图损失lossfeature_WDAmong them, loss D represents the overall loss function of the two discriminators, and loss G represents the loss function of the generator; the loss of the generator is composed of five parts, namely multi-scale discriminator loss loss gen_TD , wavelet discriminator loss loss gen_WD , Mel Loss loss mel , multi-scale time-domain discriminator feature map loss loss feature_TD , wavelet discriminator feature map loss loss feature_WD ;

两个判别器的整体损失分为多尺度判别器损失lossdisc_TD和小波判别器损失lossdisc_WD,定义为:The overall loss of the two discriminators is divided into multi-scale discriminator loss loss disc_TD and wavelet discriminator loss loss disc_WD , defined as:

Figure BDA0004174592340000041
Figure BDA0004174592340000041

Figure BDA0004174592340000042
Figure BDA0004174592340000042

其中x代表原始语音信号波形,s代表梅尔语谱图,z代表高斯噪声向量,TD和WD分别代表多尺度判别器和小波判别器,下标k表示不同尺度;G(s,z)表示生成的语音信号;

Figure BDA0004174592340000043
表示期望;Where x represents the original speech signal waveform, s represents the Mel spectrogram, z represents the Gaussian noise vector, TD and WD represent the multi-scale discriminator and wavelet discriminator respectively, the subscript k represents different scales; G(s, z) represents the generated speech signal;
Figure BDA0004174592340000043
Express expectations;

生成器的多尺度判别器损失lossgen_TD和生成器的小波判别器损失lossgen_WD的定义为:The generator's multi-scale discriminator loss loss gen_TD and the generator's wavelet discriminator loss loss gen_WD are defined as:

Figure BDA0004174592340000044
Figure BDA0004174592340000044

Figure BDA0004174592340000045
Figure BDA0004174592340000045

其中

Figure BDA0004174592340000046
分别对应不同尺度的三个多尺度判别器;in
Figure BDA0004174592340000046
Three multi-scale discriminators corresponding to different scales;

梅尔损失lossmel是利用多尺度的梅尔语谱图量化原始语音波形和合成语音波形之间的差距,其定义为:Mel loss loss mel is the use of multi-scale Mel spectrogram to quantify the gap between the original speech waveform and the synthesized speech waveform, which is defined as:

lossmel=||MEL(x)-MEL(G(s,z))||F (10)loss mel =||MEL(x)-MEL(G(s,z))|| F (10)

其中||·||F表示F范数,其中MEL(·)表示对于给定语音信号的梅尔语谱图变换;Wherein ||·|| F represents the F norm, wherein MEL( ) represents the Mel spectrogram transformation for a given speech signal;

步骤4:成语音评价体系;通过主观评价体系和客观评价体系对重建的语音信号的可懂度和自然度指标进行评估;Step 4: form a speech evaluation system; evaluate the intelligibility and naturalness indicators of the reconstructed speech signal through the subjective evaluation system and the objective evaluation system;

步骤4-1:主观评价体系使用平均意见得分MOS作为评价指标Step 4-1: The subjective evaluation system uses the mean opinion score MOS as the evaluation index

步骤4-2:客观评价体系;Step 4-2: Objective evaluation system;

步骤4-2-1:采用峰值信噪比PSNR、梅尔倒谱失真MCD和基频的均方根误差F0RMSE三个客观指标度量合成语音和原始语音信号的差异;采用听写测试正确率ADT衡量合成语音的可懂度;Step 4-2-1: Measure the difference between the synthesized speech and the original speech signal using the three objective indicators of peak signal-to-noise ratio PSNR, Mel cepstrum distortion MCD and fundamental frequency root mean square error F0RMSE; use dictation test accuracy rate ADT to measure intelligibility of synthesized speech;

峰值信噪比PSNR测量信号的最大可能功率与影响其质量的噪声功率之间的比率:Peak Signal-to-Noise Ratio PSNR measures the ratio between the maximum possible power of a signal and the noise power affecting its quality:

Figure BDA0004174592340000051
Figure BDA0004174592340000051

其中,Speak表示峰值语音信号,S表示语音信号,N表示噪声信号;Wherein, S peak represents the peak speech signal, S represents the speech signal, and N represents the noise signal;

步骤4-2-2:利用梅尔倒谱失真MCD量化原始语音信号和合成语音信号在梅尔倒谱特征MFCC之间的差距,具体地,第k帧的梅尔倒谱失真表示为:Step 4-2-2: Use the Mel cepstrum distortion MCD to quantify the gap between the original speech signal and the synthesized speech signal in the Mel cepstrum feature MFCC, specifically, the Mel cepstrum distortion of the kth frame is expressed as:

Figure BDA0004174592340000052
Figure BDA0004174592340000052

其中r代表原始语音信号,M是梅尔滤波器的个数,MCs(i,k)和MCr(i,k)是合成语音信号s和原始语音信号r的MFCC系数;Where r represents the original speech signal, M is the number of Mel filters, MC s (i, k) and MC r (i, k) are the MFCC coefficients of the synthesized speech signal s and the original speech signal r;

省略MC的下标,表示为:Omit the subscript of MC, expressed as:

Figure BDA0004174592340000053
Figure BDA0004174592340000053

其中Xk,n代表第n个三角滤波器的对数功率输出,即:where X k,n represents the logarithmic power output of the nth triangular filter, namely:

Figure BDA0004174592340000054
Figure BDA0004174592340000054

其中X(k,m)中代表频率索引为m的输入语音框架的第k帧傅里叶变换结果,wn(m)代表第n个梅尔滤波器;Wherein X(k, m) represents the kth frame Fourier transform result of the input speech frame whose frequency index is m, and w n (m) represents the nth Mel filter;

步骤4-2-3:使用基频的均方根误差F0 RMSE比较原始语音的基频和合成语音的基频之间的差距,表示为:Step 4-2-3: Use the root mean square error F0 RMSE of the fundamental frequency to compare the gap between the fundamental frequency of the original speech and the fundamental frequency of the synthesized speech, expressed as:

Figure BDA0004174592340000055
Figure BDA0004174592340000055

其中f0代表原始语音信号的基频特征,

Figure BDA0004174592340000056
代表合成语音信号的基频特征。Where f0 represents the fundamental frequency feature of the original speech signal,
Figure BDA0004174592340000056
Represents the fundamental frequency characteristics of the synthesized speech signal.

优选地,所述a=20。Preferably, the a=20.

优选地,所述梅尔滤波器的个数M=10。Preferably, the number of mel filters M=10.

本发明的有益效果如下:The beneficial effects of the present invention are as follows:

本发明的数据采集可以批量获取多种频率采样模式下的多音频文件运动传感器采样信号。数据处理实现了说话人无关的通用传感器语音合成框架,降低了对特定说话人数据集的依赖。语音重建通过优化模型架构和引入小波判别器,提升了语音合成的高频表现和鲁棒性,使得合成语音更接近原始语音。生成语音评价论证了智能手机运动传感器具备还原扬声器语音的感知能力。The data acquisition of the present invention can acquire multi-audio file motion sensor sampling signals in multiple frequency sampling modes in batches. Data processing enables a speaker-independent general sensor-based speech synthesis framework, reducing the dependency on speaker-specific datasets. Speech reconstruction improves the high-frequency performance and robustness of speech synthesis by optimizing the model architecture and introducing wavelet discriminators, making the synthesized speech closer to the original speech. Generating Speech Reviews demonstrates the perceptual ability of smartphone motion sensors to reproduce speaker speech.

附图说明Description of drawings

图1为本发明基于智能手机加速度传感器的深度学习语音重建方法的示意框架图。Fig. 1 is a schematic framework diagram of the deep learning speech reconstruction method based on the smartphone acceleration sensor of the present invention.

图2为本发明基于小波的多尺度时频域生成对抗网络的语音重建算法示意图。Fig. 2 is a schematic diagram of the voice reconstruction algorithm based on the wavelet-based multi-scale time-frequency domain generative adversarial network of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

本发明的目的是提供一种基于智能手机加速度传感器的语音重建方法,旨在解决现有技术中运动传感器数据所蕴含的隐私安全问题,以及提高语音合成的质量和效率。The purpose of the present invention is to provide a voice reconstruction method based on the acceleration sensor of a smart phone, aiming at solving the privacy security problem contained in the motion sensor data in the prior art, and improving the quality and efficiency of voice synthesis.

本发明面向智能手机内置加速度传感器信号进行语音重建,建立了内置加速度传感器与手机扬声器之间状态的内在关联,建立了振动信号在加速度传感器和手机扬声器之间的传播模型,实现了从扬声器引起的加速度计振动中重建原始语音信号的功能。The present invention carries out voice reconstruction for the signal of the built-in acceleration sensor of the smart phone, establishes the internal correlation between the state of the built-in acceleration sensor and the speaker of the mobile phone, establishes a propagation model of the vibration signal between the acceleration sensor and the speaker of the mobile phone, and realizes the vibration caused by the speaker. Functions for reconstructing raw speech signals in accelerometer vibrations.

如图1所示,本发明采用如下技术方案:As shown in Figure 1, the present invention adopts following technical scheme:

一:数据采集;利用自行开发的安卓APP应用程序播放音频文件,并根据需要选择多种运动传感器、多种频率采样模式对智能手机扬声器引起的主板振动信号进行采集。其有益效果是可以批量获取多种频率采样模式下的多音频文件运动传感器采样信号。One: Data collection; use the self-developed Android APP to play audio files, and select a variety of motion sensors and frequency sampling modes to collect the vibration signals of the main board caused by the speaker of the smart phone according to the needs. The beneficial effect is that the motion sensor sampling signals of multiple audio files in multiple frequency sampling modes can be acquired in batches.

二:数据处理;集成多种数字信号处理方案,对运动传感器信号进行数据预处理、噪声分析与剔除、数据特征(梅尔语谱图)提取,对加速度传感器信号进行噪声建模,分离出感知数据中的噪声成分。其有益效果是实现了说话人无关的通用传感器语音合成框架,降低了对特定说话人数据集的依赖。Two: Data processing; integrate multiple digital signal processing schemes, perform data preprocessing, noise analysis and elimination, and data feature (Mel spectrogram) extraction on motion sensor signals, perform noise modeling on acceleration sensor signals, and separate perception Noise component in the data. Its beneficial effect is to realize a speaker-independent general sensor speech synthesis framework and reduce the dependence on specific speaker data sets.

三:语音重建;如图2所示,提出了基于小波的多尺度时频域生成对抗网络的语音重建算法。这种基于生成对抗网络结构设计的传感器-语音映射模型将预处理后的运动传感器数据转化为语音波形数据,利用对抗生成网络建立去噪感知数据与原始语音数据之间的映射函数,并引入小波判别器来提升语音合成的高频效果。其有益效果是通过优化模型架构和引入小波判别器,提升了语音合成的高频表现和鲁棒性,使得合成语音更接近原始语音。Three: Speech reconstruction; as shown in Figure 2, a speech reconstruction algorithm based on wavelet-based multi-scale time-frequency domain generative confrontation network is proposed. This sensor-speech mapping model based on the structure design of generative confrontation network converts the preprocessed motion sensor data into speech waveform data, uses the confrontation generation network to establish the mapping function between the denoising perception data and the original speech data, and introduces wavelet Discriminator to improve the high-frequency effect of speech synthesis. The beneficial effect is that by optimizing the model architecture and introducing a wavelet discriminator, the high-frequency performance and robustness of speech synthesis are improved, making the synthesized speech closer to the original speech.

四:生成语音评价体系。通过主观评价体系和客观评价体系对重建的语音信号的可懂度和自然度等指标进行评估。其有益效果是论证了智能手机运动传感器具备还原扬声器语音的感知能力。Four: Generate a speech evaluation system. The intelligibility and naturalness of the reconstructed speech signal are evaluated by subjective evaluation system and objective evaluation system. The beneficial effect is to demonstrate that the motion sensor of the smart phone has the perception ability to restore the voice of the speaker.

具体实施例:Specific examples:

本发明的具体步骤如下:Concrete steps of the present invention are as follows:

步骤1:数据采集。Step 1: Data collection.

利用自行开发的安卓APP应用程序SensorListener播放音频文件,并根据需要选择多种运动传感器、多种频率采样模式对智能手机扬声器引起的主板振动信号进行采集,并保存至本地。具体地,此应用程序会通过安卓提供的接口获取手机本地存储的所有音频文件,并将其显示在应用主页上。用户可以在主页面选择所要采集传感器数据的音频,点击开始采集后,应用程序便会播放音频,并记录传感器信号。具体地,针对每一条音频,应用程序会获取线性加速度传感器对象、加速度传感器对象、陀螺仪传感器对象,并将其注册为监听者(Listener),然后创建媒体播放对象(MediaPlayer),开始播放音频;在此期间,传感器会记录每一个振动信号和对应的时间戳;当音频播放完毕,MediaPlayer会返回一个结束信号,通过注销运动传感器对象的监听者身份,结束传感器信号的采集;最后分别将线性加速度信号、加速度信号、陀螺仪信号写入音频的同名CSV文件中。此应用程序会通过安卓提供的接口获取手机本地存储的所有音频文件,并将其显示在应用主页上。用户可以在主页面选择所要采集传感器数据的音频,点击开始采集后,应用程序便会播放音频,并记录传感器信号。具体地,针对每一条音频,应用程序会获取线性加速度传感器对象、加速度传感器对象、陀螺仪传感器对象,并将其注册为监听者(Listener),然后创建媒体播放对象(MediaPlayer),开始播放音频;在此期间,传感器会记录每一个振动信号和对应的时间戳;当音频播放完毕,MediaPlayer会返回一个结束信号,通过注销运动传感器对象的监听者身份,结束传感器信号的采集;最后分别将线性加速度信号、加速度信号、陀螺仪信号写入音频的同名CSV文件中。用户可以自行选择多种频率采样模式捕获运动传感器的输出信号。Use the self-developed Android APP application SensorListener to play audio files, and select a variety of motion sensors and frequency sampling modes to collect vibration signals of the motherboard caused by smartphone speakers and save them locally. Specifically, this application will obtain all audio files locally stored in the mobile phone through the interface provided by Android, and display them on the application home page. The user can select the audio of the sensor data to be collected on the main page. After clicking to start collecting, the application will play the audio and record the sensor signal. Specifically, for each piece of audio, the application will obtain linear acceleration sensor objects, acceleration sensor objects, and gyroscope sensor objects, and register them as listeners (Listener), and then create a media player object (MediaPlayer) to start playing audio; During this period, the sensor will record each vibration signal and the corresponding time stamp; when the audio is played, MediaPlayer will return an end signal to end the acquisition of the sensor signal by canceling the listener identity of the motion sensor object; finally, the linear acceleration Signal, acceleration signal, and gyroscope signal are written to the CSV file with the same name as the audio. This application will obtain all audio files stored locally on the phone through the interface provided by Android, and display them on the application home page. The user can select the audio of the sensor data to be collected on the main page. After clicking to start collecting, the application will play the audio and record the sensor signal. Specifically, for each piece of audio, the application will obtain linear acceleration sensor objects, acceleration sensor objects, and gyroscope sensor objects, and register them as listeners (Listener), and then create a media player object (MediaPlayer) to start playing audio; During this period, the sensor will record each vibration signal and the corresponding time stamp; when the audio is played, MediaPlayer will return an end signal to end the acquisition of the sensor signal by canceling the listener identity of the motion sensor object; finally, the linear acceleration Signal, acceleration signal, and gyroscope signal are written to the CSV file with the same name as the audio. Users can choose a variety of frequency sampling modes to capture the output signal of the motion sensor.

步骤2:数据处理。Step 2: Data processing.

集成多种数字信号处理方案,对运动传感器信号进行数据预处理、噪声分析与剔除、数据特征(梅尔语谱图)提取。①线性插值,本发明通过时间戳来定位所有没有加速度数据的时间点,并使用线性插值来填补缺失的数据;对于具有同一个时间戳的加速度数据,本发明采用取均值的手段,以此均值代表这个时间戳的加速度数据,从而使得每个时间戳都有且仅有一个加速度数据。②噪声处理,使用静默条件下的基准数据剔除传感器信号中由于重力因素和硬件因素导致的噪声;使用截止频率为50Hz的高通滤波去除人类活动的影响。③特征提取,本发明将传感器信号分成多个具有固定重叠的短段,分段和重叠的长度分别设置为256和64,并使用汉明窗对每个片段进行窗口化,通过快速傅里叶变换(STFT)计算其频谱得到STFT矩阵。该矩阵记录每个时间和频率的幅度与相位,并根据下式:Integrate a variety of digital signal processing solutions to perform data preprocessing, noise analysis and elimination, and data feature (Mel spectrogram) extraction on motion sensor signals. 1. linear interpolation, the present invention locates all time points without acceleration data by timestamp, and uses linear interpolation to fill in missing data; Represents the acceleration data of this timestamp, so that each timestamp has one and only one acceleration data. ②Noise processing, use the benchmark data under silent conditions to eliminate the noise caused by gravity and hardware factors in the sensor signal; use a high-pass filter with a cutoff frequency of 50Hz to remove the influence of human activities. ③ feature extraction, the present invention divides the sensor signal into a plurality of short segments with fixed overlap, the lengths of the segments and overlaps are set to 256 and 64 respectively, and each segment is windowed using the Hamming window, through the fast Fourier transform Transform (STFT) to calculate its spectrum to get STFT matrix. This matrix records the magnitude and phase for each time and frequency according to the following equation:

spectrogram{x(n)}=|STFT{x(n)}2 (1)spectrogram{x(n)}=|STFT{x(n)} 2 (1)

转换为对应的语谱图,其中x(n)代表加速度传感器信号,STFT{x(n)}代表加速度传感器信号所对应的STFT矩阵。语谱图的横轴x为时间,纵轴y为频率,(x,y)对应的数值代表在时间x时频率y的幅值。为了使得语谱图的频率符合人耳对频率按照对数分布的模式,本发明将语谱图转换为梅尔语谱图。通过使用下面的两个公式实现加速度传感器信号对应频谱图上的频率(Hz)与梅尔尺度(mel)之间的相互转换:Convert to the corresponding spectrogram, where x(n) represents the acceleration sensor signal, and STFT{x(n)} represents the STFT matrix corresponding to the acceleration sensor signal. The horizontal axis x of the spectrogram is time, the vertical axis y is frequency, and the value corresponding to (x, y) represents the amplitude of frequency y at time x. In order to make the frequency of the spectrogram conform to the logarithmic distribution pattern of the frequency of the human ear, the present invention converts the spectrogram into a Mel spectrogram. The mutual conversion between the frequency (Hz) on the spectrogram corresponding to the acceleration sensor signal and the Mel scale (mel) is realized by using the following two formulas:

Figure BDA0004174592340000081
Figure BDA0004174592340000081

Figure BDA0004174592340000082
Figure BDA0004174592340000082

最终使用梅尔滤波器得到特征——梅尔语谱图。Finally, the feature is obtained by using the Mel filter - the Mel spectrogram.

步骤3:语音重建。Step 3: Speech reconstruction.

基于生成对抗网络结构设计的传感器-语音映射模型将预处理后的运动传感器数据转化为语音波形数据。如图2,本发明基于生成对抗网络的思想构建的WMelGAN模型,包括三大核心组件:生成器、多尺度判别器和小波判别器。生成器旨在将梅尔语谱图转换为可理解的语音音频波形信号,通过一系列上采样转置卷积网络将梅尔语谱序列转换为数据量更高的语音波形信号,每个转置卷积模块后设置有带有空洞卷积层的残差网络以获得更大的感受野,更宽阔的感受野有助于捕捉长向量之间的跨空间关联性,有助于学习声学结构性特征。多尺度判别器的不同子判别器具有相似的基于卷积网络的模型结构,工作在不同的数据尺度上以对不同尺度的生成器拟合效果进行判断。每个时域判别器的网络结构是由前后各一层1维卷积和4层分组卷积构成的降采样结构,其输入是由原始的语音信号和由生成网络生成的生成语音信号两部分构成。小波判别器通过三次小波分解,将输入语音信号分解为四个不同频带的子信号,并利用堆叠的卷积神经网络对生成效果进行评估判断。在WMelGAN的框架中,生成器和多个判别器以对抗的方式进行训练,使得生成器生成的音频达到判别器组无法判断真假的效果,最终利用生成器生成可懂度高的高质量语音信号。The sensor-speech mapping model based on the generative adversarial network structure design converts the preprocessed motion sensor data into speech waveform data. As shown in Figure 2, the WMelGAN model constructed by the present invention based on the idea of generative confrontation network includes three core components: generator, multi-scale discriminator and wavelet discriminator. The generator is designed to convert the mel spectrogram into an intelligible speech audio waveform signal, and convert the mel spectrogram sequence into a speech waveform signal with a higher amount of data through a series of upsampled transposed convolutional networks, each conversion After setting the convolution module, a residual network with a dilated convolution layer is set to obtain a larger receptive field. The wider receptive field helps to capture the cross-space correlation between long vectors and helps to learn the acoustic structure. sexual characteristics. Different sub-discriminators of the multi-scale discriminator have similar convolutional network-based model structures, and work on different data scales to judge the fitting effect of generators at different scales. The network structure of each time-domain discriminator is a downsampling structure composed of a layer of 1-dimensional convolution and a 4-layer group convolution. Its input is two parts: the original speech signal and the generated speech signal generated by the generation network. constitute. The wavelet discriminator decomposes the input speech signal into four sub-signals of different frequency bands through cubic wavelet decomposition, and uses the stacked convolutional neural network to evaluate and judge the generation effect. In the framework of WMelGAN, the generator and multiple discriminators are trained in an adversarial manner, so that the audio generated by the generator reaches the effect that the discriminator group cannot judge the authenticity, and finally the generator is used to generate high-quality speech with high intelligibility Signal.

本发明提出的基于小波的多尺度时频域生成对抗网络的语音重建算法通过设置一系列损失函数进行生成器和判别器之间的对抗训练,目标如公式(4)(5)所示:The voice reconstruction algorithm based on the wavelet-based multi-scale time-frequency domain generative confrontation network proposed by the present invention performs confrontation training between the generator and the discriminator by setting a series of loss functions, and the target is shown in formula (4)(5):

lossD=lossdisc_TD+lossdisc_WD (4)loss D = loss disc_TD + loss disc_WD (4)

lossG=lossgen_TD+lossgen_WD+lossmel*45+(lossfeature_TD+lossfeature_WD)*2 (5)loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)

其中lossD代表判别网络的损失函数,lossG代表生成网络的损失函数。判别器的损失函数是由多尺度时域判别网络和小波判别网络输出的特征来计算的,这个损失函数最小化了原始语音信号和生成语音信号的判别网络特征图之间的L1距离。生成网络的损失由五部分构成,分别是多尺度时域判别器损失lossgen_TD、小波判别器损失lossgen_WD、梅尔损失lossmel、多尺度时域判别器特征图损失lossfeature_TD、小波判别器特征图损失lossfeature_WD。判别器的多尺度时域判别网络损失lossdisc_TD和判别器的小波判别网络lossdisc_WD的定义为:Among them, loss D represents the loss function of the discriminant network, and loss G represents the loss function of the generation network. The loss function of the discriminator is calculated from the features output by the multi-scale temporal discriminative network and the wavelet discriminative network. This loss function minimizes the L1 distance between the original speech signal and the feature map of the discriminative network that generates the speech signal. The loss of the generation network consists of five parts, namely multi-scale time-domain discriminator loss loss gen_TD , wavelet discriminator loss loss gen_WD , Mel loss mel , multi-scale time-domain discriminator feature map loss loss feature_TD , wavelet discriminator feature Graph loss loss feature_WD . The multi-scale time domain discriminant network loss loss disc_TD of the discriminator and the wavelet discriminant network loss disc_WD of the discriminator are defined as:

Figure BDA0004174592340000091
Figure BDA0004174592340000091

Figure BDA0004174592340000092
Figure BDA0004174592340000092

其中x代表原始波形,s代表声学特征(例如,梅尔语谱图),z代表高斯噪声向量,TD和WD分别代表多尺度时域判别网络和小波判别网络。生成器的多尺度时域判别网络损失lossgen_TD和生成器的小波判别网络lossgen_WD的定义为:where x represents the original waveform, s represents the acoustic features (e.g., mel spectrogram), z represents the Gaussian noise vector, and TD and WD represent the multiscale temporal discriminant network and wavelet discriminative network, respectively. The generator's multi-scale time-domain discriminant network loss loss gen_TD and the generator's wavelet discriminant network loss gen_WD are defined as:

Figure BDA0004174592340000093
Figure BDA0004174592340000093

Figure BDA0004174592340000094
Figure BDA0004174592340000094

其中

Figure BDA0004174592340000095
分别代表不同尺度的三个时域判别器所产生的损失。梅尔损失lossmel是利用多尺度的梅尔语谱图量化原始语音波形和合成语音波形之间的差距,其定义为:in
Figure BDA0004174592340000095
Represent the losses produced by the three temporal discriminators at different scales, respectively. Mel loss loss mel is the use of multi-scale Mel spectrogram to quantify the gap between the original speech waveform and the synthesized speech waveform, which is defined as:

lossmel=||MEL(x)-MEL(G(s,z))||F (10)loss mel =||MEL(x)-MEL(G(s,z))|| F (10)

其中||·||F表示F范数,其中MEL(·)表示对于给定语音信号的梅尔语谱图变换。where ||·|| F represents the F-norm, and MEL(·) represents the Mel spectrogram transform for a given speech signal.

步骤4:生成语音评价。Step 4: Generate voice reviews.

通过主观评价体系和客观评价体系对重建的语音信号的可懂度和自然度等指标进行评估。主观评价体系,参考自语音合成领域。由于语音合成最终服务对象是人类,一般的语音合成模型使用平均意见得分(MOS)作为评价指标。本研究体系在语音效果相关的评价中参考设计主观评价体系;客观评价体系,参考自人类行为识别与信号处理领域。具体地,本发明采用峰值信噪比(PSNR)、梅尔倒谱失真(Mel-Cepstral Distortion,MCD)和基频的均方根误差(F0 Root Mean Square Error,F0 RMSE)这三个客观的指标来度量合成语音和参考语音信号的差异。此外,本发明还采用听写测试正确率(Accuracy of DictationTest,ADT)来衡量合成语音的可懂度。为了量化从加速度传感器信号中重建出来的合成语音信号的质量,峰值信噪比(PSNR)可以测量信号的最大可能功率与影响其质量的噪声功率之间的比率:The intelligibility and naturalness of the reconstructed speech signal are evaluated by subjective evaluation system and objective evaluation system. The subjective evaluation system refers to the field of speech synthesis. Since the final service object of speech synthesis is human beings, the general speech synthesis model uses Mean Opinion Score (MOS) as an evaluation index. This research system refers to the subjective evaluation system in the evaluation of speech effects; the objective evaluation system refers to the field of human behavior recognition and signal processing. Specifically, the present invention uses the three objective parameters of Peak Signal-to-Noise Ratio (PSNR), Mel-Cepstral Distortion (Mel-Cepstral Distortion, MCD) and Root Mean Square Error (F0 Root Mean Square Error, F0 RMSE) of the fundamental frequency. metric to measure the difference between the synthesized speech and the reference speech signal. In addition, the present invention also uses Accuracy of Dictation Test (ADT) to measure the intelligibility of synthesized speech. To quantify the quality of a synthetic speech signal reconstructed from an accelerometer signal, the peak signal-to-noise ratio (PSNR) measures the ratio between the maximum possible power of the signal and the noise power affecting its quality:

Figure BDA0004174592340000101
Figure BDA0004174592340000101

为了测量原始语音信号和合成语音信号之间的距离,本发明提出利用梅尔倒谱失真(MCD)量化这2个信号在梅尔倒谱特征(MFCC)之间的差距。具体地,第k帧的梅尔倒谱失真可以表示为:In order to measure the distance between the original speech signal and the synthesized speech signal, the present invention proposes to use Mel Cepstrum Distortion (MCD) to quantify the difference between the two signals in Mel Cepstrum Characteristic (MFCC). Specifically, the Mel cepstrum distortion of the kth frame can be expressed as:

Figure BDA0004174592340000102
Figure BDA0004174592340000102

其中s代表合成语音信号,r代表原始语音信号,M是梅尔滤波器的个数,本发明中M=10,MCs(i,k)和MCr(i,k)是合成语音信号s和原始语音信号r的MFCC系数。为了简单起见,省略MC的下标之后,其可以表示为:Wherein s represents the synthesized speech signal, r represents the original speech signal, M is the number of the Mel filter, M=10 among the present invention, MC s (i, k) and MC r (i, k) are synthesized speech signals s and the MFCC coefficients of the original speech signal r. For simplicity, after omitting the subscript of MC, it can be expressed as:

Figure BDA0004174592340000103
Figure BDA0004174592340000103

其中Xk,n代表第n个三角滤波器的对数功率输出,即:where X k,n represents the logarithmic power output of the nth triangular filter, namely:

Figure BDA0004174592340000104
Figure BDA0004174592340000104

其中X(k,m)中代表的是频率索引为m的输入语音框架的第k帧傅里叶变换结果,wn(m)代表的是第n个梅尔滤波器。显然地,梅尔倒谱失真(MCD)的值越低,说明从内置加速度传感器中重建出来的合成音频更加接近原始的语音信号。本发明使用基频的均方根误差(F0 RMSE)比较原始语音基频和合成语音基频之间的差距,这个值越低,说明两个语音信号之间的基频轮廓越接近,生成效果越好。具体地,基频的均方根误差表示为:Among them, X(k,m) represents the Fourier transform result of frame k of the input speech frame with frequency index m, and w n (m) represents the nth Mel filter. Obviously, the lower the value of Mel Cepstral Distortion (MCD), the closer the synthetic audio reconstructed from the built-in accelerometer is to the original speech signal. The present invention uses the root mean square error (F0 RMSE) of the base frequency to compare the gap between the original voice base frequency and the synthesized voice base frequency. the better. Specifically, the root mean square error of the fundamental frequency is expressed as:

Figure BDA0004174592340000111
Figure BDA0004174592340000111

其中f0代表原始语音信号的基频特征,

Figure BDA0004174592340000112
则代表合成语音信号的基频特征。Where f0 represents the fundamental frequency feature of the original speech signal,
Figure BDA0004174592340000112
represents the fundamental frequency characteristics of the synthesized speech signal.

本发明使用听写测试正确率和平均意见得分作为主管指标,峰值信噪比、梅尔倒谱失真和基频的均方根误差作为客观指标构建生成语音效果评估模型,从而度量重建语音的生成效果。The present invention uses the correct rate of dictation test and the average opinion score as the supervisory index, and the peak signal-to-noise ratio, the Mel cepstrum distortion and the root mean square error of the fundamental frequency as the objective index to construct and generate the speech effect evaluation model, thereby measuring the generation effect of the reconstructed speech .

Claims (3)

1. The deep learning voice reconstruction method based on the smart phone acceleration sensor is characterized by comprising the following steps of:
step 1: collecting data;
playing the audio file by using a mobile phone;
collecting signals of an acceleration sensor, and recording the signals and corresponding time stamps;
step 2: data processing; performing linear interpolation, noise processing and feature extraction on the sensor signals;
step 2-1: linear interpolation;
locating all time points without acceleration data through time stamps, and filling the missing data by using linear interpolation; for acceleration data with the same time stamp, adopting a mean value taking means, wherein the mean value represents the acceleration data of the time stamp, so that each time stamp has only one acceleration data;
step 2-2: noise processing;
removing noise caused by gravity factors and hardware factors from sensor signals by using reference data under a silent condition; filtering with a high pass filter having a cut-off frequency of aHz;
step 2-3: feature extraction
Dividing the sensor signal into a plurality of segments with fixed overlap, setting the segment length and the overlap length as 256 and 64 respectively, windowing each segment by using a hamming window, calculating a frequency spectrum by short-time fourier transform (STFT) to obtain an STFT matrix, recording the amplitude and phase of each time and frequency, and converting into a corresponding spectrogram according to formula (1):
spectrogram{x(n)}=|STFT{x(n)}| 2 (1)
wherein x (n) represents an acceleration sensor signal, and STFT { x (n) } represents an STFT matrix corresponding to the acceleration sensor signal;
converting the spectrogram into a Mel spectrogram, and realizing the frequency f and Mel scale f on the spectrogram corresponding to the acceleration sensor signal through the steps (2) and (3) mel Interconversion between:
Figure FDA0004174592330000011
Figure FDA0004174592330000012
finally, converting the spectrogram into a Mel spectrogram by using a formula (2) to obtain a characteristic-Mel spectrogram;
step 3: reconstructing voice;
generating a voice reconstruction model WMelGAN of an countermeasure network by adopting a multi-scale time-frequency domain based on wavelets, and converting the sensor data processed in the step 2 into a synthetic voice signal;
step 3-1: the wavelet-based multi-scale time-frequency domain generation of a speech reconstruction model of an countermeasure network includes three components: a generator, a multi-scale discriminator, and a wavelet discriminator;
the generator converts the Mel spectrogram into a synthetic voice signal, the generator is formed by a series of up-sampling transposed convolution layers, and a residual error network with a cavity convolution layer is arranged behind each transposed convolution layer;
different sub-discriminants of the multi-scale discriminant have model structures based on a convolution network and work on different data scales to judge the output results of generators of different scales; the network structure of each sub-arbiter is a down-sampling structure consisting of a layer 1-dimensional convolution, a layer 4-dimensional grouping convolution and a layer one-dimensional sum in sequence, and the input of the sub-arbiter comprises an original voice signal and a synthesized voice signal generated by a generating network device;
the wavelet discriminator decomposes the input voice signal into four sub-signals with different frequency bands through three times of wavelet decomposition, and evaluates and judges the generation effect by using a stacked convolutional neural network; in the WMelGAN model, the generator and the plurality of discriminators train in a countermeasure mode, so that the audio generated by the generator reaches the effect that the discriminators cannot judge true or false, and finally, the generator is utilized to generate a final synthesized voice signal;
step 3-2: a loss function;
the countermeasure training between the generator and the arbiter is performed by setting a series of loss functions, the targets are as shown in formulas (4) and (5):
loss D =loss disc_TD +loss disc_WD (4)
loss G =loss gen_TD +loss gen_WD +loss mel *45+(loss feature_TD +loss feature_WD )*2 (5)
wherein loss is D Representing the overall loss function of two discriminators, loss G Representing a loss function of the generator; the loss of the generator is composed of five parts, namely a loss of the multi-scale discriminator gen_TD Loss of wavelet discriminators loss gen_WD Mel loss mel Loss of feature map loss of multi-scale time domain discriminator feature_TD Loss of wavelet discriminant feature map loss feature_WD
The overall loss of two discriminators is divided into multi-scale discriminators loss disc_TD And wavelet discriminant loss disc_WD The definition is:
Figure FDA0004174592330000021
Figure FDA0004174592330000022
wherein x represents an original voice signal waveform, s represents a Mel spectrogram, z represents a Gaussian noise vector, TD and WD respectively represent a multi-scale discriminator and a wavelet discriminator, and subscript k represents different scales; g (s, z) represents the generated speech signal;
Figure FDA0004174592330000031
representing the desire;
multi-scale discriminant loss of generator gen_TD Wavelet discriminant loss of sum generator gen_WD Is defined as:
Figure FDA0004174592330000032
Figure FDA0004174592330000033
wherein the method comprises the steps of
Figure FDA0004174592330000034
Three multi-scale discriminators corresponding to different scales respectively;
mel loss mel The difference between the original voice waveform and the synthesized voice waveform is quantized by utilizing a multi-scale mel spectrogram, which is defined as:
loss mel =||MEL(x)-MEL(G(s,z))|| F (10)
wherein I II F Representing the F-norm, wherein MEL (·) represents the MEL-gram transform for a given speech signal;
step 4: a idiomatic evaluation system; evaluating the intelligibility and naturalness index of the reconstructed voice signal through a subjective evaluation system and an objective evaluation system;
step 4-1: subjective evaluation system uses mean opinion score MOS as evaluation index
Step 4-2: an objective evaluation system;
step 4-2-1: measuring the difference between the synthesized voice and the original voice signal by adopting three objective indexes of peak signal-to-noise ratio PSNR, mel cepstrum distortion MCD and root mean square error F0RMSE of fundamental frequency; the intelligibility of the synthesized voice is measured by adopting dictation test accuracy ADT;
peak signal-to-noise ratio PSNR measures the ratio between the maximum possible power of a signal and the noise power affecting its quality:
Figure FDA0004174592330000035
wherein S is peak Representing peak speech signals, S representing speech signals, N representing noise signals;
step 4-2-2: the difference between the original speech signal and the synthesized speech signal between the mel-cepstrum features MFCCs is quantized using the mel-cepstrum distortion MCD, specifically, the mel-cepstrum distortion of the k-th frame is expressed as:
Figure FDA0004174592330000036
where r represents the original speech signal, M is the number of Mel filters, MC s (i, k) and MC r (i, k) are MFCC coefficients of the synthesized speech signal s and the original speech signal r;
omitting the subscript of MC, expressed as:
Figure FDA0004174592330000041
wherein X is k,n Representing the logarithmic power output of the nth triangular filter, namely:
Figure FDA0004174592330000042
where X (k, m) represents the k-th frame fourier transform result, w, of an input speech frame with frequency index m n (m) represents an nth mel filter;
step 4-2-3: the difference between the fundamental frequency of the original speech and the fundamental frequency of the synthesized speech is compared using the root mean square error F0RMSE of the fundamental frequency, expressed as:
Figure FDA0004174592330000043
where f0 represents the fundamental frequency characteristic of the original speech signal,
Figure FDA0004174592330000044
representing the fundamental frequency characteristics of the synthesized speech signal.
2. The smart phone acceleration sensor-based deep learning speech reconstruction method according to claim 1, wherein a=20.
3. The deep learning voice reconstruction method based on the smart phone acceleration sensor according to claim 1, wherein the number m=10 of mel filters.
CN202310387588.XA 2023-04-12 2023-04-12 A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor Pending CN116386589A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310387588.XA CN116386589A (en) 2023-04-12 2023-04-12 A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310387588.XA CN116386589A (en) 2023-04-12 2023-04-12 A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor

Publications (1)

Publication Number Publication Date
CN116386589A true CN116386589A (en) 2023-07-04

Family

ID=86978564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310387588.XA Pending CN116386589A (en) 2023-04-12 2023-04-12 A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor

Country Status (1)

Country Link
CN (1) CN116386589A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727329A (en) * 2024-02-07 2024-03-19 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision
CN119649813A (en) * 2025-02-17 2025-03-18 苏州大学 Method and system for restoring speech from facial movements on mobile phones based on deep learning
CN119649814A (en) * 2025-02-17 2025-03-18 苏州大学 Mobile phone facial action voice recovery system based on convolution and attention mechanism

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727329A (en) * 2024-02-07 2024-03-19 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision
CN117727329B (en) * 2024-02-07 2024-04-26 深圳市科荣软件股份有限公司 Multi-target monitoring method for intelligent supervision
CN119649813A (en) * 2025-02-17 2025-03-18 苏州大学 Method and system for restoring speech from facial movements on mobile phones based on deep learning
CN119649814A (en) * 2025-02-17 2025-03-18 苏州大学 Mobile phone facial action voice recovery system based on convolution and attention mechanism

Similar Documents

Publication Publication Date Title
CN116386589A (en) A Deep Learning Speech Reconstruction Method Based on Smartphone Acceleration Sensor
EP2064698B1 (en) A method and a system for providing sound generation instructions
CN101023469B (en) Digital filtering method, digital filtering equipment
CN113724712A (en) Bird sound identification method based on multi-feature fusion and combination model
CN112382302B (en) Baby crying recognition method and terminal device
CN110136746B (en) Method for identifying mobile phone source in additive noise environment based on fusion features
CN109872720A (en) A Robust Re-recorded Speech Detection Algorithm for Different Scenarios Based on Convolutional Neural Networks
Shen et al. RARS: Recognition of audio recording source based on residual neural network
CN114863937A (en) Hybrid bird song recognition method based on deep transfer learning and XGBoost
CN114863905A (en) Voice category acquisition method, device, electronic device and storage medium
CN117316178A (en) Voiceprint recognition method, device, equipment and medium for power equipment
Zhang et al. URGENT challenge: Universality, robustness, and generalizability for speech enhancement
CN114882906A (en) Novel environmental noise identification method and system
CN117454240A (en) Ship target identification method and system based on underwater acoustic signals
CN113611321A (en) Voice enhancement method and system
CN114664311B (en) Method for detecting variation inference silent attack of memory network enhancement
CN119741946A (en) Abnormal sound detection method, device, equipment, storage medium and program product
CN114512140B (en) Voice enhancement method, device and equipment
CN111261192A (en) Audio detection method based on LSTM network, electronic equipment and storage medium
CN117542373A (en) A non-air conduction speech recovery system and method
CN103778914B (en) Anti-noise voice identification method and device based on signal-to-noise ratio weighing template characteristic matching
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
CN115602190A (en) Forged voice detection algorithm and system based on main body filtering
CN113362840B (en) General voice information recovery device and method based on undersampled data of built-in sensor
CN114822559A (en) A system and method for short-term speech speaker recognition based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination