CN111739544B

CN111739544B - Voice processing method, device, electronic equipment and storage medium

Info

Publication number: CN111739544B
Application number: CN201910227101.5A
Authority: CN
Inventors: 陈岩
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2023-10-20
Anticipated expiration: 2039-03-25
Also published as: CN111739544A

Abstract

The present disclosure provides a voice processing method, device, electronic equipment and computer-readable storage medium, relating to the technical field of audio processing. The voice processing method includes: receiving a voice signal acquired and sent by an audio collection device; The time-domain signal corresponding to the signal undergoes pitch-shifting processing for adjusting the sampling frequency to obtain a pitch-modified speech signal; the time-domain signal corresponding to the pitch-modified speech signal is maintained for playback time to obtain the target speech signal; wherein, the pitch-modified speech signal The playback time of the subsequent voice signal is the same as the playback time of the voice signal. The present disclosure enables quick and accurate speech intonation modification.

Description

Speech processing method, device, electronic equipment and storage medium

技术领域Technical field

本公开涉及音频处理技术领域，具体而言，涉及一种语音处理方法、语音处理装置、电子设备以及计算机可读存储介质。The present disclosure relates to the field of audio processing technology, and specifically, to a speech processing method, a speech processing device, an electronic device, and a computer-readable storage medium.

背景技术Background technique

在音频处理过程中，音频变调处理是非常重要的功能。相关技术中，变调方法主要包括以下几种：通过改变播放的采样率来实现语音音频的变调；采用线性预测编码技术和微分声门波相结合的方法合成变调语音；或者是采用计算语音信号的频谱包络以及变调算法来改变音调；或者是通过延时因子进行延迟处理从而实现变调效果。In the audio processing process, audio pitch modification processing is a very important function. In related technologies, pitch modification methods mainly include the following: achieving pitch modification of speech audio by changing the sampling rate of playback; using a method combining linear predictive coding technology and differential glottal waves to synthesize pitch-modified speech; or using a method of calculating speech signals. Spectral envelope and pitch-shifting algorithms are used to change the pitch; or delay processing is performed through a delay factor to achieve a pitch-shifting effect.

上述方式中，改变播放的采样率来实现变调时会影响语音的播放时长，进而还可能影响语音的音质，并且计算量较大，不能实现语音的快速变调。In the above method, changing the playback sampling rate to achieve pitch change will affect the playback duration of the voice, which may also affect the sound quality of the voice. Moreover, the calculation amount is large and the rapid pitch change of the voice cannot be achieved.

需要说明的是，在上述背景技术部分公开的信息仅用于加强对本公开的背景的理解，因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。It should be noted that the information disclosed in the above background section is only used to enhance understanding of the background of the present disclosure, and therefore may include information that does not constitute prior art known to those of ordinary skill in the art.

发明内容Contents of the invention

本公开的目的在于提供一种语音处理方法、装置、电子设备及计算机可读存储介质，进而至少在一定程度上克服由于相关技术的限制和缺陷而导致的无法快速精准地实现语音变调的问题。The purpose of the present disclosure is to provide a speech processing method, device, electronic equipment and computer-readable storage medium, thereby overcoming, at least to a certain extent, the problem of being unable to quickly and accurately implement speech pitch changes due to limitations and defects in related technologies.

本公开的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本公开的实践而习得。Additional features and advantages of the disclosure will be apparent from the following detailed description, or, in part, may be learned by practice of the disclosure.

根据本公开的一个方面，提供一种语音处理方法，包括：接收由音频采集设备获取并发送的语音信号；对所述语音信号对应的时域信号进行用于调整采样频率的变调处理，得到变调后的语音信号；将变调后的语音信号对应的时域信号进行播放时间保持，以得到目标语音信号；其中，所述变调后的语音信号的播放时间与所述语音信号的播放时间相同。According to one aspect of the present disclosure, a speech processing method is provided, including: receiving a speech signal acquired and sent by an audio collection device; performing pitch modification processing on the time domain signal corresponding to the speech signal for adjusting the sampling frequency to obtain the pitch modification The time domain signal corresponding to the pitch-modified speech signal is maintained for playback time to obtain the target speech signal; wherein the playback time of the pitch-modified speech signal is the same as the playback time of the speech signal.

在本公开的一种示例性实施例中，对所述语音信号对应的时域信号进行用于调整采样频率的变调处理，得到变调后的语音信号包括：对所述语音信号对应的时域信号进行分帧；对分帧后的语音信号对应的时域信号进行加窗处理，得到加窗后的语音信号对应的时域信号；根据内插算法或抽取算法对所述加窗后的语音信号对应的时域信号进行处理，得到所述变调后的语音信号。In an exemplary embodiment of the present disclosure, performing a pitch modification process for adjusting the sampling frequency on the time domain signal corresponding to the speech signal, and obtaining the pitch-modified speech signal includes: modifying the time domain signal corresponding to the speech signal. Carry out frame division; perform windowing processing on the time domain signal corresponding to the framed speech signal to obtain the time domain signal corresponding to the windowed speech signal; perform window processing on the windowed speech signal according to an interpolation algorithm or an extraction algorithm. The corresponding time domain signal is processed to obtain the pitch-modified speech signal.

在本公开的一种示例性实施例中，对分帧后的时域信号进行加窗处理包括：采用汉明窗对所述分帧后的语音信号的时域信号进行所述加窗处理。In an exemplary embodiment of the present disclosure, performing windowing processing on the framed time domain signal includes: using a Hamming window to perform the windowing processing on the time domain signal of the framed speech signal.

在本公开的一种示例性实施例中，根据内插算法或抽取算法对所述加窗后的语音信号对应的时域信号进行处理，得到所述变调后的语音信号包括：根据所述语音信号的采样频率、变调后的语音信号的采样频率以及每帧语音信号的长度确定所述变调后的语音信号。In an exemplary embodiment of the present disclosure, processing the time domain signal corresponding to the windowed speech signal according to an interpolation algorithm or an extraction algorithm to obtain the pitch-modified speech signal includes: according to the speech signal The sampling frequency of the signal, the sampling frequency of the pitch-modified speech signal, and the length of each frame of the speech signal determine the pitch-modified speech signal.

在本公开的一种示例性实施例中，所述语音信号升调对应于变调后的语音信号的播放时间增加，所述语音信号降调对应于变调后的语音信号的播放时间减少。In an exemplary embodiment of the present disclosure, the rising pitch of the voice signal corresponds to an increase in the playback time of the pitch-modified voice signal, and the falling pitch of the voice signal corresponds to a decrease in the playback time of the pitch-modified voice signal.

在本公开的一种示例性实施例中，将变调后的语音信号对应的时域信号进行播放时间保持，以得到目标语音信号包括：确定时序变量与分帧得到的两帧语音信号之间的重叠长度的对比结果；结合所述对比结果，根据每帧语音信号的长度对变调后的每帧语音信号的长度进行处理，并在变调后的语音信号的播放时间与所述语音信号的播放时间相同时确定所述目标语音信号。In an exemplary embodiment of the present disclosure, maintaining the playback time of the time domain signal corresponding to the modulated speech signal to obtain the target speech signal includes: determining the time sequence variable and the difference between the two frame speech signals obtained by framing. Comparison results of overlapping lengths; combined with the comparison results, process the length of each frame of the voice signal after the tone change according to the length of each frame of voice signal, and compare the playback time of the voice signal after the tone change with the playback time of the voice signal. The target speech signal is determined at the same time.

在本公开的一种示例性实施例中，结合所述对比结果，根据每帧语音信号的长度对变调后的每帧语音信号的长度进行处理，并在变调后的语音信号的播放时间与所述语音信号的播放时间相同时确定所述目标语音信号包括：若所述时序变量小于所述重叠长度，则根据所述每帧语音信号的长度、所述变调后的每帧语音信号的长度以及所述重叠长度确定所述目标语音信号；若所述时序变量大于等于所述重叠长度，则将变调后的语音信号作为所述目标语音信号。In an exemplary embodiment of the present disclosure, combined with the comparison result, the length of each frame of the speech signal after the modulation is processed according to the length of each frame of the speech signal, and the playback time of the modulation of the speech signal is consistent with the length of the speech signal. Determining the target speech signal while the playback time of the speech signals is the same includes: if the timing variable is less than the overlap length, then based on the length of each frame of speech signal, the length of each frame of speech signal after the modulation and The overlap length determines the target speech signal; if the timing variable is greater than or equal to the overlap length, the pitch-modified speech signal is used as the target speech signal.

根据本公开的一个方面，提供一种语音处理装置，包括：语音获取模块，用于接收由音频采集设备获取并发送的语音信号；语音变调模块，用于对所述语音信号对应的时域信号进行用于调整采样频率的变调处理，得到变调后的语音信号；时间保持模块，用于将变调后的语音信号对应的时域信号进行播放时间保持，以得到目标语音信号；其中，所述变调后的语音信号的播放时间与所述语音信号的播放时间相同。According to one aspect of the present disclosure, a voice processing device is provided, including: a voice acquisition module for receiving a voice signal acquired and sent by an audio acquisition device; a voice tone modification module for processing a time domain signal corresponding to the voice signal Performing a pitch-shifting process for adjusting the sampling frequency to obtain a pitch-modified speech signal; a time retention module for maintaining the playback time of the time domain signal corresponding to the pitch-modified speech signal to obtain a target speech signal; wherein, the pitch-modified speech signal The playback time of the subsequent voice signal is the same as the playback time of the voice signal.

根据本公开的一个方面，提供一种电子设备，包括：处理器；以及存储器，用于存储所述处理器的可执行指令；其中，所述处理器配置为经由执行所述可执行指令来执行上述任意一项所述的语音处理方法。According to one aspect of the present disclosure, an electronic device is provided, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform via executing the executable instructions. The speech processing method described in any one of the above.

根据本公开的一个方面，提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述任意一项所述的语音处理方法。According to one aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, any one of the above speech processing methods is implemented.

本示例性实施例提供的语音处理方法、装置、电子设备及计算机可读存储介质中，一方面，通过对发送至所述音频处理器的语音信号对应的时域信号进行用于调整采样频率的变调处理，由于是对时域信号进行变调处理，避免了处理过程中引入谐波并影响语音音质的问题，提高了音频质量以及精准度；另一方面，通过将变调后的语音信号的时域信号进行播放时间保持，以得到目标语音信号，避免了对播放时间的影响，使得语音能够正常准确地进行播放；再一方面，由于只是通过对语音信号对应的时域信号进行变调处理，避免了复杂的计算过程，减小了计算量，提高了计算效率，能够快速实现语音变调。In the speech processing method, device, electronic device and computer-readable storage medium provided by this exemplary embodiment, on the one hand, the time domain signal corresponding to the speech signal sent to the audio processor is used to adjust the sampling frequency. Pitch transposition processing, because it modulates the time domain signal, avoids the problem of introducing harmonics and affecting the voice quality during the processing, and improves the audio quality and accuracy; on the other hand, by converting the time domain of the modulated speech signal The playback time of the signal is maintained to obtain the target speech signal, which avoids the impact on the playback time, so that the speech can be played normally and accurately; on the other hand, because only the time domain signal corresponding to the speech signal is modulated, it avoids The complex calculation process reduces the amount of calculation, improves calculation efficiency, and can quickly achieve voice tone changes.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.

附图说明Description of the drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理。显而易见地，下面描述中的附图仅仅是本公开的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1示意性示出本公开示例性实施例中语音处理方法的示意图。FIG. 1 schematically shows a schematic diagram of a speech processing method in an exemplary embodiment of the present disclosure.

图2示意性示出本公开示例性实施例中变调处理的具体流程图。FIG. 2 schematically illustrates a specific flowchart of pitch transposition processing in an exemplary embodiment of the present disclosure.

图3示意性示出本公开示例性实施例中播放时间保持的流程图。FIG. 3 schematically shows a flowchart of playback time maintenance in an exemplary embodiment of the present disclosure.

图4示意性示出本公开示例性实施例中语音处理装置的框图。FIG. 4 schematically shows a block diagram of a speech processing device in an exemplary embodiment of the present disclosure.

图5示意性示出本公开示例性实施例中语音处理系统的框图。FIG. 5 schematically illustrates a block diagram of a speech processing system in an exemplary embodiment of the present disclosure.

图6示意性示出本公开示例性实施例中的电子设备的示意图。FIG. 6 schematically shows a schematic diagram of an electronic device in an exemplary embodiment of the present disclosure.

图7示意性示出本公开示例性实施例中的计算机可读存储介质的示意图。FIG. 7 schematically illustrates a computer-readable storage medium in an exemplary embodiment of the present disclosure.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本公开将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中，提供许多具体细节从而给出对本公开的实施方式的充分理解。然而，本领域技术人员将意识到，可以实践本公开的技术方案而省略所述特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concepts of the example embodiments. To those skilled in the art. The described features, structures or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details described, or other methods, components, devices, steps, etc. may be adopted. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the disclosure.

此外，附图仅为本公开的示意性图解，并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体，不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings represent the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software form, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

相关技术中的变调方法包括以下几种：通过改变播放的采样率来实现语音音频的变调，当提高采样率播放，语音的播放速度加快，会产生升调的效果，但同时播放时间也变短了，当降低采样率播放，语音的播放速度减慢，会产生将调的效果，但同时播放时间也变长了。在频域上用内插的方法实现音调的变化，例如，需要两倍的频率的音调，则会内插一些能量为原频点能量一半的频率分量。在频域上采用内插的方法实现音调变化，会引入谐波，影响音质。采用线性预测编码技术和微分声门波相结合的方法合成变调语音，将语音信号通过线性预测编码技术中的逆滤波器得到的残差信号，用微分声门波模型对其进行更细致的模拟得到高质量的声门激励信号，从而合成高质量的变调语音；或者是利用语音信号的倒谱序列导出频谱包络，然后利用频谱包络分离出语音信号的激励分量，并将激励分量通过变调算法处理改变其音调；计算频谱包络等过程需要对语音信号进行傅里叶变换和反变换，计算量比较大，不适合在DSP上运行。Pitch-changing methods in related technologies include the following: changing the sampling rate of playback to achieve pitch-changing of voice audio. When the sampling rate is increased for playback, the playback speed of the voice will be accelerated, which will produce a rising pitch effect, but at the same time, the playback time will also be shortened. Yes, when the sampling rate is reduced for playback, the playback speed of the voice will slow down, which will produce the effect of tuning, but the playback time will also become longer. The interpolation method is used to achieve changes in pitch in the frequency domain. For example, if a pitch requires twice the frequency, some frequency components with half the energy of the original frequency point will be interpolated. Using interpolation in the frequency domain to achieve pitch changes will introduce harmonics and affect the sound quality. A method combining linear predictive coding technology and differential glottal waves is used to synthesize pitch-modified speech. The residual signal obtained by passing the speech signal through the inverse filter in the linear predictive coding technology is simulated in more detail using the differential glottal wave model. Obtain high-quality glottal excitation signals to synthesize high-quality pitch-modified speech; or use the cepstral sequence of the speech signal to derive the spectrum envelope, then use the spectrum envelope to separate the excitation component of the speech signal, and pass the excitation component through pitch-modification The algorithm changes its pitch; processes such as calculating the spectrum envelope require Fourier transform and inverse transform of the speech signal, which requires a large amount of calculation and is not suitable for running on a DSP.

为了解决上述问题，本示例性实施例中，首先提供了一种语音处理方法，该语音处理方法可以应用于能够使用语音交互的游戏或者是其它应用程序的应用场景。参考图1所示，对本示例性实施例中的语音处理方法进行详细说明。In order to solve the above problem, in this exemplary embodiment, a voice processing method is first provided. The voice processing method can be applied to application scenarios of games or other application programs that can use voice interaction. Referring to FIG. 1 , the speech processing method in this exemplary embodiment will be described in detail.

在步骤S110中，接收由音频采集设备获取并发送的语音信号。In step S110, the voice signal acquired and sent by the audio collection device is received.

本示例性实施例中，音频采集设备可以为终端上的麦克风，终端可以为智能手机、电脑、智能手表、智能音箱等可以进行通话的终端，此处以智能手机为例进行说明。另外，本示例性实施例可以应用于游戏、或者其它应用程序中，为了满足保密性或者是满足其它需求而需要对采集到的语音进行特殊处理的应用场景中，即语音交互或者是语音通话具有变调音效。In this exemplary embodiment, the audio collection device may be a microphone on a terminal, and the terminal may be a smartphone, computer, smart watch, smart speaker, or other terminal capable of making calls. Here, a smartphone is used as an example for explanation. In addition, this exemplary embodiment can be applied to games or other applications, in application scenarios that require special processing of the collected voice in order to meet confidentiality or other requirements, that is, voice interaction or voice calls have Pitch-shifting sound effects.

本示例性实施例中，以游戏中的语音聊天为例进行说明。在语音通话具有变调音效的基础上，首先可以判断是否开启了变调音效，具体可通过判断用于表示变调音效的控件或者是按钮的状态来判断，也可以通过其它方式来判断，此处不作具体描述。若检测到开启了变调音效，则音频采集设备(麦克风)可以采集处于手机游戏的用户发出的语音信号。进一步地，麦克风可将采集到的语音信号发送至手机中的DSP(Digital SignalProcessing，数字信号处理)器，以使DSP器对接收到的语音信号进行处理。In this exemplary embodiment, voice chat in a game is taken as an example for explanation. On the basis that the voice call has a pitch-changing sound effect, you can first determine whether the pitch-changing sound effect is turned on. This can be determined by judging the state of the control or button used to represent the pitch-changing sound effect. It can also be judged by other methods, which will not be detailed here. describe. If it is detected that the pitch-changing sound effect is turned on, the audio collection device (microphone) can collect the voice signal sent by the user in the mobile game. Further, the microphone can send the collected voice signal to a DSP (Digital Signal Processing, digital signal processor) in the mobile phone, so that the DSP can process the received voice signal.

在步骤S120中，对所述语音信号对应的时域信号进行用于调整采样频率的变调处理，得到变调后的语音信号。In step S120, the time domain signal corresponding to the speech signal is subjected to a pitch modulation process for adjusting the sampling frequency to obtain a modulated speech signal.

本示例性实施例中，语音信号可以包括时域信号和频域信号。其中，时域信号是描述数学函数或物理信号对时间的关系，一个语音信号的时域波形可以表达语音信号随着时间的变化。频域信号指的是把语音信号变为以频率轴为坐标表示出来。在从时域信号转换到频域信号时，需要通过傅里叶级数和傅里叶变换实现。In this exemplary embodiment, the voice signal may include a time domain signal and a frequency domain signal. Among them, the time domain signal describes the relationship between a mathematical function or a physical signal and time. The time domain waveform of a speech signal can express the change of the speech signal over time. Frequency domain signals refer to converting speech signals into coordinates expressed on the frequency axis. When converting from a time domain signal to a frequency domain signal, it needs to be implemented through Fourier series and Fourier transform.

变调处理的主要功能可以包括但不限于：在时域上对语音信号进行变调处理，也就是说，对语音信号对应的时域信号进行处理以实现变调。变调指的是将语音信号的音调升高(升调)或者是降低(降调)。除此之外，语音信号的音调的变化情况可与采样频率相关联。例如，若变调后的采样频率升高，则升调；若变调后的采样频率降低，则降调。基于此，可认为变调处理用于调整采样频率。采样频率定义了每秒从连续的语音信号中提取并组成离散信号的采样个数。步骤S120的具体执行过程可以如图2中所示。The main functions of the pitch modification processing may include but are not limited to: performing pitch modification processing on the speech signal in the time domain, that is, processing the time domain signal corresponding to the speech signal to achieve pitch modification. Pitch change refers to raising the pitch of the speech signal (rising pitch) or lowering it (falling pitch). In addition, changes in the pitch of the speech signal can be correlated with the sampling frequency. For example, if the sampling frequency after pitch change increases, the pitch will rise; if the sampling frequency after pitch change decreases, the pitch will fall. Based on this, it can be considered that the pitch shifting process is used to adjust the sampling frequency. The sampling frequency defines the number of samples per second that are extracted from the continuous speech signal and formed into a discrete signal. The specific execution process of step S120 can be shown in Figure 2.

图2中示意性示出了变调处理的流程图。参考图2中所示，主要包括步骤S210至步骤S230，其中：A flowchart of the pitch transposition process is schematically shown in FIG. 2 . Referring to what is shown in Figure 2, it mainly includes steps S210 to S230, where:

在步骤S210中，对所述语音信号对应的时域信号进行分帧。In step S210, the time domain signal corresponding to the speech signal is framed.

在本步骤中，为了保持语音信号的稳定性，以满足信号处理的要求，可对语音信号进行分帧。分帧指的是将语音信号分段来分析其特征参数，其中每一段称为一帧，帧长一般取为20～50ms。这样，对于整体的语音信号来讲，分析出的是由每一帧特征参数组成的特征参数时间序列。本示例性实施例中，麦克风采集的语音信号可用x(n)来表示。该语音信号经过分帧后，每帧的长度可以为N，用于防止两帧之间的不连续的前后两帧之间的重叠长度(帧移)可以为W。x(n)中的n表示的是时序上的一个点，可称为时序变量，n为整数，且n＝0,1,1,3,…N-1。对语音信号x(n)进行分帧，得到的分帧后的语音信号可以表示为x_m(n)，其中m代表帧数为第m帧，语音信号每帧的长度N可以取值为512，当然也可以取其它值，此处不作特殊限定。需要说明的是，本步骤中是对语音信号的时域信号进行分帧处理。In this step, in order to maintain the stability of the speech signal and meet the requirements of signal processing, the speech signal can be divided into frames. Framing refers to dividing the speech signal into segments to analyze its characteristic parameters. Each segment is called a frame, and the frame length is generally 20 to 50ms. In this way, for the overall speech signal, what is analyzed is the feature parameter time series composed of the feature parameters of each frame. In this exemplary embodiment, the voice signal collected by the microphone can be represented by x(n). After the speech signal is framed, the length of each frame may be N, and the overlap length (frame shift) between the two frames to prevent discontinuity between the two frames may be W. n in x(n) represents a point in time series, which can be called a time series variable. n is an integer, and n=0,1,1,3,...N-1. The speech signal x(n) is framed, and the obtained framed speech signal can be expressed as x _m (n), where m represents the number of frames, which is the mth frame. The length N of each frame of the speech signal can be 512 , of course, it can also take other values, and there are no special restrictions here. It should be noted that in this step, the time domain signal of the speech signal is framed.

在步骤S220中，对分帧后的语音信号对应的时域信号进行加窗处理，得到加窗后的语音信号对应的时域信号。In step S220, windowing processing is performed on the time domain signal corresponding to the framed speech signal to obtain a time domain signal corresponding to the windowed speech signal.

本步骤中，依然是对语音信号的时域信号进行处理。加窗处理的目的在于让语音信号中不太连续的地方(最后一个点和第一个点的连接处)变得光滑，避免了明显的突变，即加窗处理用于平滑帧信号的边缘。对于加窗处理而言，就是在傅里叶积分中，将原来的被积函数与特定的窗函数做积，这样的结果可以起到时频局域化的效果。加窗一般是滤波器，通带内的系统函数不一定是常数值，加窗在时域进行，窗函数的频域形状是一个窗，把带外的分量滤除，相当于低通滤波器；若是矩形滤波器，相当于低通滤波，把带外高频分量直接滤除。In this step, the time domain signal of the speech signal is still processed. The purpose of windowing is to smooth the less continuous parts of the speech signal (the connection between the last point and the first point) and avoid obvious mutations. That is, windowing is used to smooth the edges of the frame signal. For windowing processing, it is to multiply the original integrand with a specific window function in Fourier integral. This result can achieve the effect of time-frequency localization. Windowing is generally a filter. The system function in the passband is not necessarily a constant value. Windowing is performed in the time domain. The frequency domain shape of the window function is a window. It filters out the components outside the band, which is equivalent to a low-pass filter. ; If it is a rectangular filter, it is equivalent to low-pass filtering, directly filtering out-of-band high-frequency components.

本示例性实施例中，在对分帧后的语音信号的时域信号进行加窗处理时，具体可以采用汉明窗或者是矩形窗等进行处理，此处以汉明窗为例进行说明。汉明窗对应的窗函数的主要部分的形状像sin(x)在0到pi区间的形状，而其余部分都是0，这样的函数乘上其他任何一个函数，均只有一部分有非零值。汉明窗可对原有的语音信号的序列进行一定的修正，从而得到更好的语音信号。In this exemplary embodiment, when windowing is performed on the time domain signal of the framed speech signal, a Hamming window or a rectangular window may be used for processing. Here, the Hamming window is used as an example for explanation. The shape of the main part of the window function corresponding to the Hamming window is like the shape of sin(x) in the interval from 0 to pi, and the rest is 0. When such a function is multiplied by any other function, only part of it has non-zero values. The Hamming window can make certain corrections to the original speech signal sequence to obtain a better speech signal.

汉明窗具体可以用公式(1)来表示：The Hamming window can be expressed specifically by formula (1):

其中，n为表示时序上的一个点(时序参数)的整数，且n＝0,1,2,3...N-1。Among them, n is an integer representing a point in the timing (timing parameter), and n=0,1,2,3...N-1.

采用公式(1)中的汉明窗对分帧后的语音信号的时域信号进行加窗处理，可以得到如公式(2)所示的加窗后的时域信号：Using the Hamming window in formula (1) to window the time domain signal of the framed speech signal, the windowed time domain signal as shown in formula (2) can be obtained:

通过步骤S210和步骤S220，对采集的语音信号进行分帧以及加窗等预处理操作，能够消除因为人类发声器官本身和由于采集语音信号的设备所带来的混叠、高次谐波失真、高频等等因素，对语音信号质量的影响；尽可能保证后续语音处理得到的信号更均匀、平滑，为信号参数提取提供优质的参数，提高语音处理质量。Through steps S210 and S220, preprocessing operations such as framing and windowing are performed on the collected speech signals to eliminate aliasing, high-order harmonic distortion, and distortion caused by the human vocal organs themselves and the equipment for collecting speech signals. High frequency and other factors have an impact on the quality of speech signals; try to ensure that the signals obtained by subsequent speech processing are more uniform and smooth, provide high-quality parameters for signal parameter extraction, and improve the quality of speech processing.

在步骤S230中，根据内插算法或抽取算法对所述加窗后的语音信号对应的时域信号进行处理，得到所述变调后的语音信号。In step S230, the time domain signal corresponding to the windowed speech signal is processed according to an interpolation algorithm or an extraction algorithm to obtain the pitch-modified speech signal.

本步骤中，内插算法和抽取算法都是通过调整语音信号的采样点数或者是采样频率，以对语音信号进行变调的变调算法，且此处的变调算法均是针对语音信号的时域信号执行。具体而言，内插算法指的是在需要插值的地方插入零值(即0)从而组成新的语音信号的序列。内插算法例如可以包括但不限于线性函数插值、立方插值等等，且内插算法用于增加音调即升调。具体的过程例如可以包括对语音信号进行补零扩展以及内插滤波。抽取算法指的是将语音信号中每几个点中抽取一个依次组成新的语音信号的序列，且抽取算法的目的在于降低音调即降调。In this step, the interpolation algorithm and the extraction algorithm are both pitch-shifting algorithms that change the pitch of the speech signal by adjusting the number of sampling points or the sampling frequency of the speech signal, and the pitch-shifting algorithms here are all executed on the time domain signal of the speech signal. . Specifically, the interpolation algorithm refers to inserting zero values (i.e. 0) where interpolation is required to form a new speech signal sequence. The interpolation algorithm may include, for example, but is not limited to linear function interpolation, cubic interpolation, etc., and the interpolation algorithm is used to increase the pitch, that is, to raise the pitch. The specific process may include, for example, zero-filling expansion and interpolation filtering of the speech signal. The extraction algorithm refers to extracting one of every few points in the speech signal to form a sequence of new speech signals in turn, and the purpose of the extraction algorithm is to lower the pitch, that is, lower the tone.

根据内插算法或抽取算法对所述加窗后的语音信号对应的时域信号进行处理，得到所述变调后的语音信号的具体过程包括：根据所述语音信号的采样频率、变调后的语音信号的采样频率以及每帧语音信号的长度确定所述变调后的语音信号。The time domain signal corresponding to the windowed speech signal is processed according to the interpolation algorithm or the extraction algorithm. The specific process of obtaining the pitch-modified speech signal includes: according to the sampling frequency of the speech signal, the pitch-modified speech signal The sampling frequency of the signal and the length of each frame of the speech signal determine the pitch-modified speech signal.

举例而言，若变调前的语音信号的采样频率为f，变调后的语音信号的采样频率为f₀，则可以用公式(3)表示抽取处理或者是内插处理之后的语音信号：For example, if the sampling frequency of the speech signal before pitch modification is f and the sampling frequency of the speech signal after pitch modification is f ₀ , the speech signal after extraction processing or interpolation processing can be expressed by formula (3):

其中，n＝0,1,2...(N-1)×L+1，[]表示取整运算，mod表示取余运算。其中M，L均为正整数，且/>为最简分数。Among them, n=0,1,2...(N-1)×L+1, [] represents the rounding operation, and mod represents the remainder operation. Where M and L are both positive integers, and/> is the simplest fraction.

进一步地，在进行内插或者是抽取之后，可得到如公式(4)所示的变调后的语音信号：Further, after interpolation or extraction, the pitch-modified speech signal shown in formula (4) can be obtained:

y_m(n)＝z_m(Mn) (4)y _m (n)＝z _m (Mn) (4)

其中，n＝0,1,2...N×L/M。Among them, n=0,1,2...N×L/M.

由此可见，当f>f₀时，M>L，则变调后的语音信号升调；当f<f₀时，M<L，则变调后的语音信号降调。It can be seen that when f>f ₀ , M>L, the voice signal after the pitch change will rise; when f<f ₀ , M<L, the voice signal after the pitch change will fall.

在步骤S130中，将变调后的语音信号对应的时域信号进行播放时间保持，以得到目标语音信号；其中，所述变调后的语音信号的播放时间与所述语音信号的播放时间相同。In step S130, the playback time of the time domain signal corresponding to the modulated speech signal is maintained to obtain the target speech signal; wherein the playback time of the modulated speech signal is the same as the playback time of the speech signal.

本示例性实施例中，若采用内插算法对语音信号进行升调，则变调后的语音信号的播放时间增加；若采用抽取算法对语音信号进行降调，则变调后的语音信号的播放时间减少。为了避免变调处理对播放时间的影响，可对变调后的语音信号执行播放时间保持处理。播放时间保持指的是对变调后的语音信号的时域信号进行处理，使得变调后的语音信号的播放时间与变调后的语音信号的播放时间相同，避免了相关技术中通过采样率实现音调变化时，由于语音播放速度的变化而导致的对语音的播放时间的影响。In this exemplary embodiment, if the interpolation algorithm is used to raise the pitch of the speech signal, the playback time of the changed speech signal will increase; if the extraction algorithm is used to lower the pitch of the speech signal, the playback time of the changed speech signal will be reduce. In order to avoid the impact of the pitch change processing on the playback time, playback time maintenance processing can be performed on the voice signal after the pitch change. Playback time preservation refers to processing the time domain signal of the pitch-modified speech signal so that the playback time of the pitch-modified speech signal is the same as the playback time of the pitch-modified speech signal, which avoids pitch changes through sampling rate in related technologies. When the voice playback speed changes, the impact on the voice playback time is caused.

进一步地，图3中示意性实处播放时间保持的流程图，参考图3中所示，根据将变调后的语音信号对应的时域信号进行播放时间保持，以得到目标语音信号包括步骤S310和步骤S320，其中：Further, with reference to the schematic flow chart of real-time playback time maintenance in Figure 3, as shown in Figure 3, the playback time maintenance is performed according to the time domain signal corresponding to the modulated speech signal to obtain the target speech signal including steps S310 and steps S320, where:

步骤S310，确定时序变量与分帧得到的两帧语音信号之间的重叠长度的对比结果。具体指的是判断时序变量n(即时序上的一个点)与分帧得到的两帧语音信号之间的重叠长度W的大小关系。例如，当n＝1,2…W-1时可确定时序变量n小于重叠长度。当n＝W,W+1…N时，可确定时序变量n大于等于重叠长度。Step S310: Determine the comparison result between the time series variable and the overlap length between the two framed speech signals obtained by dividing the frame. Specifically, it refers to judging the size relationship between the time series variable n (i.e., a point on the time series) and the overlap length W between the two frames of speech signals obtained by dividing the frame. For example, when n=1,2...W-1, it can be determined that the timing variable n is smaller than the overlap length. When n=W, W+1...N, it can be determined that the timing variable n is greater than or equal to the overlap length.

步骤S320，结合所述对比结果，根据每帧语音信号的长度对变调后的每帧语音信号的长度进行处理，并在变调后的语音信号的播放时间与所述语音信号的播放时间相同时确定所述目标语音信号。也就是说，结合时序变量n与重叠长度W之间的大小关系，将变调后的语音信号的播放时间变更为变调前的语音信号的播放时间。由于播放时间与每帧语音信号的长度之间存在对应关系，即每帧语音信号的长度相同，则可以确定语音信号的播放时间相同。基于此，可将变调后的语音信号拼接起来，以使得语音信号的长度保持一致。进一步地，在变调后的每帧语音信号的长度等于原本的语音信号的长度时，即变调后的语音信号的播放时间与原本的语音信号的播放时间相同时，可以将该语音信号确定为目标语音信号。Step S320, combined with the comparison result, process the length of each frame of the voice signal after the modulation according to the length of each frame of the voice signal, and determine when the playback time of the tone-modified voice signal is the same as the playback time of the voice signal. the target speech signal. That is to say, based on the size relationship between the timing variable n and the overlap length W, the playback time of the speech signal after the pitch change is changed to the playback time of the speech signal before the pitch change. Since there is a corresponding relationship between the playback time and the length of each frame of speech signal, that is, the length of each frame of speech signal is the same, it can be determined that the playback time of the speech signal is the same. Based on this, the pitch-modified speech signals can be spliced together to keep the length of the speech signals consistent. Further, when the length of each frame of the speech signal after the modulation is equal to the length of the original speech signal, that is, when the playback time of the speech signal after the modulation is the same as the playback time of the original speech signal, the speech signal can be determined as the target. voice signal.

具体而言，结合所述对比结果，根据每帧语音信号的长度对变调后的每帧语音信号的长度进行处理，并在变调后的语音信号的播放时间与所述语音信号的播放时间相同时确定所述目标语音信号包括以下两种情况：情况一、若所述时序变量小于所述重叠长度，则根据所述每帧语音信号的长度、所述变调后的每帧语音信号的长度以及所述重叠长度确定所述目标语音信号。举例而言，假设变调前语音信号的每帧语音信号的长度为N，变调后信号y_m(n)的每帧语音信号的长度变为N/α，若要保持语音信号的播放时间不变，则变调后的语音信号的每帧长度需要仍为N。如果时序变量小于重叠长度，则可以根据两帧之间的重叠长度、合成位移(即每帧语音信号的长度与重叠长度之间的差值)、偏移量(两帧重叠的起始位置)来根据公式(5)确定目标语音信号。Specifically, combined with the comparison results, the length of each frame of the speech signal after the modulation is processed according to the length of each frame of the speech signal, and when the playback time of the speech signal after the modulation is the same as the playback time of the speech signal Determining the target speech signal includes the following two situations: Situation 1. If the timing variable is less than the overlap length, then according to the length of each frame of speech signal, the length of each frame of speech signal after the modulation and the The overlap length determines the target speech signal. For example, assuming that the length of each frame of the speech signal before the pitch change is N, and the length of each frame of the speech signal of the signal y _m (n) after the pitch change becomes N/α, if you want to keep the playback time of the speech signal unchanged , then the length of each frame of the modulated speech signal needs to still be N. If the timing variable is less than the overlap length, it can be based on the overlap length between the two frames, the synthetic displacement (that is, the difference between the length of the speech signal of each frame and the overlap length), and the offset (the starting position of the overlap of the two frames) To determine the target speech signal according to formula (5).

情况二、若所述时序变量大于等于所述重叠长度，则将变调后的语音信号作为所述目标语音信号。若时序变量大于等于重叠长度且不超过变前的每帧语音信号的长度N，则在进行长度对齐之后，可以直接将变调后的语音信号作为最终的目标语音信号，目标语音信号具体可由公式(5)确定。Case 2: If the timing variable is greater than or equal to the overlap length, the pitch-modified speech signal is used as the target speech signal. If the timing variable is greater than or equal to the overlap length and does not exceed the length N of each frame of the speech signal before the change, then after length alignment, the pitch-shifted speech signal can be directly used as the final target speech signal. The target speech signal can be expressed by the formula ( 5) OK.

其中，W为两帧的重叠长度，s为合成位移且s＝N-W，k_m为偏移量。偏移量的意义在于：对变调后的语音信号进行播放时间还原合成时，帧与帧之间有重叠，但是不能直接叠加合成，这样会造成有噪音杂声。为了减小这个现象，可确定两帧重叠的起始位置，并将该起始位置确定为偏移量。由于偏移量是动态变化的，当满足定义公式(6)时，可以使得噪音杂声最小，偏移量可以如公式(6)所示：Among them, W is the overlap length of the two frames, s is the synthetic displacement and s=NW, and _km is the offset. The significance of the offset is: when performing playback time restoration and synthesis of the pitch-modified speech signal, there is overlap between frames, but it cannot be directly superimposed and synthesized, which will cause noise. In order to reduce this phenomenon, the starting position of the overlap of the two frames can be determined, and the starting position can be determined as the offset. Since the offset changes dynamically, when the definition formula (6) is satisfied, the noise can be minimized, and the offset can be as shown in formula (6):

其中，偏移量表示最优匹配点与第m个窗之间的距离。Among them, the offset represents the distance between the optimal matching point and the m-th window.

本示例性实施例中，通过步骤S110至步骤S130，对语音信号对应的时域信号进行内插和抽取的同时，能够保持播放时间不变，从而实现语音信号的快速变调，且避免了对播放时间的影响。另外，由于是在时域上对语音信号进行内插，避免了引入谐波导致的影响音质的问题，提高了语音信号的质量。进一步地，由于是对语音信号的时域信号进行的内插和播放时间还原，因此不需要对语音信号进行傅里叶变换和反变化等复杂操作，减少了计算量，使得整个变调过程可以直接在DSP中运行，而不占用CPU，减小了延迟，提高了游戏性能和用户体验。In this exemplary embodiment, through steps S110 to S130, the time domain signal corresponding to the voice signal is interpolated and extracted while keeping the playback time unchanged, thereby achieving rapid pitch change of the voice signal and avoiding the need for playback. The effect of time. In addition, since the speech signal is interpolated in the time domain, problems affecting the sound quality caused by the introduction of harmonics are avoided, and the quality of the speech signal is improved. Furthermore, since the time domain signal of the speech signal is interpolated and the playback time is restored, there is no need to perform complex operations such as Fourier transformation and inverse transformation on the speech signal, which reduces the amount of calculation and makes the entire pitch modification process straightforward. Running in the DSP without occupying the CPU reduces latency and improves game performance and user experience.

本示例性实施例中，还提供一种语音处理装置，参考图4所示，该语音处理装置400主要包括：语音获取模块401、语音变调模块402以及时间保持模块403，其中：In this exemplary embodiment, a speech processing device is also provided. Referring to Figure 4, the speech processing device 400 mainly includes: a speech acquisition module 401, a speech tone modification module 402 and a time retention module 403, wherein:

语音获取模块401，可以用于接收由音频采集设备获取并发送的语音信号；The voice acquisition module 401 can be used to receive voice signals acquired and sent by the audio acquisition device;

语音变调模块402，可以用于对所述语音信号对应的时域信号进行用于调整采样频率的变调处理，得到变调后的语音信号；The speech pitch modification module 402 can be used to perform pitch modification processing on the time domain signal corresponding to the speech signal to adjust the sampling frequency to obtain a pitch-modified speech signal;

时间保持模块403，可以用于将变调后的语音信号对应的时域信号进行播放时间保持，以得到目标语音信号；其中，所述变调后的语音信号的播放时间与所述语音信号的播放时间相同。The time retention module 403 can be used to maintain the playback time of the time domain signal corresponding to the modulated speech signal to obtain the target speech signal; wherein the playback time of the modulated speech signal is equal to the playback time of the speech signal. same.

在本公开的一种示例性实施例中，语音变调模块包括：分帧模块，用于对所述语音信号对应的时域信号进行分帧；加窗模块，用于对分帧后的语音信号对应的时域信号进行加窗处理，得到加窗后的语音信号对应的时域信号；变调控制模块，用于根据内插算法或抽取算法对所述加窗后的语音信号对应的时域信号进行处理，得到所述变调后的语音信号。In an exemplary embodiment of the present disclosure, the speech tone modification module includes: a framing module, used to frame the time domain signal corresponding to the speech signal; a windowing module, used to frame the framed speech signal The corresponding time domain signal is subjected to windowing processing to obtain a time domain signal corresponding to the windowed speech signal; the pitch change control module is used to process the time domain signal corresponding to the windowed speech signal according to an interpolation algorithm or an extraction algorithm. Perform processing to obtain the pitch-modified speech signal.

在本公开的一种示例性实施例中，加窗模块包括：加窗控制模块，用于采用汉明窗对所述分帧后的语音信号的时域信号进行所述加窗处理。In an exemplary embodiment of the present disclosure, the windowing module includes: a windowing control module, configured to use a Hamming window to perform the windowing process on the time domain signal of the framed speech signal.

在本公开的一种示例性实施例中，变调控制模块包括：语音确定模块，用于根据所述语音信号的采样频率、变调后的语音信号的采样频率以及每帧语音信号的长度确定所述变调后的语音信号。In an exemplary embodiment of the present disclosure, the pitch change control module includes: a voice determination module, configured to determine the voice signal according to the sampling frequency of the voice signal, the sampling frequency of the voice signal after pitch change, and the length of each frame of voice signal. Pitch-modified speech signal.

在本公开的一种示例性实施例中，时间保持模块包括：信号对比模块，用于确定时序变量与分帧得到的两帧语音信号之间的重叠长度的对比结果；目标语音确定模块，用于结合所述对比结果，根据每帧语音信号的长度对变调后的每帧语音信号的长度进行处理，并在变调后的语音信号的播放时间与所述语音信号的播放时间相同时确定所述目标语音信号。In an exemplary embodiment of the present disclosure, the time keeping module includes: a signal comparison module, used to determine the comparison result of the timing variable and the overlap length between the two frames of speech signals obtained by dividing the frame; a target speech determination module, using In combination with the comparison results, the length of each frame of the speech signal after the modulation is processed according to the length of each frame of the speech signal, and when the playback time of the modulation speech signal is the same as the playback time of the speech signal, the length of the speech signal is determined. target speech signal.

在本公开的一种示例性实施例中，目标语音确定模块包括：第一确定模块，用于若所述时序变量小于所述重叠长度，则根据所述每帧语音信号的长度、所述变调后的每帧语音信号的长度以及所述重叠长度确定所述目标语音信号；第二确定模块，用于若所述时序变量大于等于所述重叠长度，则将变调后的语音信号作为所述目标语音信号。In an exemplary embodiment of the present disclosure, the target speech determination module includes: a first determination module, configured to: if the timing variable is less than the overlap length, according to the length of the speech signal of each frame, the pitch change The length of each frame of speech signal and the overlap length determine the target speech signal; the second determination module is used to use the modified speech signal as the target if the timing variable is greater than or equal to the overlap length. voice signal.

需要说明的是，上述语音处理装置中各模块的具体细节已经在对应的语音处理方法中进行了详细描述，因此此处不再赘述。It should be noted that the specific details of each module in the above speech processing device have been described in detail in the corresponding speech processing method, so they will not be described again here.

除此之外，还提供一种语音处理系统，参考图5所示，语音处理系统50主要包括：数字信号处理器51和中央处理器52，其中：In addition, a voice processing system is also provided. As shown in Figure 5, the voice processing system 50 mainly includes: a digital signal processor 51 and a central processing unit 52, wherein:

数字信号处理器51，可以用于对语音信号进行变调，并对变调后的语音信号进行播放时间保持，得到目标语音信号。参考图5中所示，数字信号处理器51主要包括以下模块：变调模块511，用于对语音信号对应的时域信号进行变调处理；以及播放时间保持模块512，用于对变调后的语音信号进行播放时间保持，使得变调后的语音信号的播放时间与变调之前的语音信号的播放时间相同。具体地，变调模块511主要包括用于进行分帧的分帧模块5111、用于进行加窗处理的加窗模块5112、用于进行变调的变调控制模块5113。The digital signal processor 51 can be used to modulate the pitch of the speech signal, and maintain the playback time of the modulated speech signal to obtain the target speech signal. Referring to Figure 5, the digital signal processor 51 mainly includes the following modules: a pitch modification module 511, which is used to perform pitch modification processing on the time domain signal corresponding to the speech signal; and a playback time retention module 512, which is used to modify the pitch-modified speech signal. The playback time is maintained so that the playback time of the voice signal after the pitch change is the same as the playback time of the voice signal before the pitch change. Specifically, the pitch modulation module 511 mainly includes a framing module 5111 for framing, a windowing module 5112 for windowing, and a modulation control module 5113 for modulating.

中央处理器52，用于运行游戏或者是应用程序。The central processing unit 52 is used for running games or application programs.

除此之外，语音处理系统50还可以包括音频采集设备53，用于收集语音信号，并将收集的语音信号发送至数字信号处理器51。In addition, the voice processing system 50 may also include an audio collection device 53 for collecting voice signals and sending the collected voice signals to the digital signal processor 51 .

如此一来，整个过程可以包括：游戏运行在手机CPU上，当用户开启语音通话的变调音效时，麦克风首先收集语音信号，并将收集的语音信号发送到DSP；然后，变调模块对语音信号对应的时域信号进行升调或者降调处理；再次，经由变调模块处理的语音信号，播放时间会变长或者变短，因此将语音信号传输至播放时间保持模块，以使变调前后的播放时间保持不变；进一步地，将经过播放时间保持模块的语音，由DSP送到CPU上运行的游戏进程。如此一来，在游戏聊天时，语音信号的音调就发生改变，但是语音信号的播放时间并不会发生变化，能够快速精准地实现语音变调效果。由于进行变调和播放时间保持的算法可以运行在DSP上，因此不占用CPU，不影响游戏性能和用户体验，且能够提高处理效率。In this way, the whole process can include: the game runs on the mobile phone's CPU. When the user turns on the pitch-changing sound effect of the voice call, the microphone first collects the voice signal and sends the collected voice signal to the DSP; then, the pitch-changing module responds to the voice signal The time domain signal is processed with rising or falling pitch; again, the playback time of the voice signal processed by the pitch change module will become longer or shorter, so the voice signal is transmitted to the playback time retention module to maintain the playback time before and after the pitch change. remains unchanged; further, the voice of the module that has passed the playback time is sent by the DSP to the game process running on the CPU. In this way, during game chat, the pitch of the voice signal changes, but the playback time of the voice signal does not change, enabling the voice pitch change effect to be quickly and accurately achieved. Since the algorithm for changing pitch and maintaining playback time can run on the DSP, it does not occupy the CPU, does not affect game performance and user experience, and can improve processing efficiency.

应当注意，尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元，但是这种划分并非强制性的。实际上，根据本公开的实施方式，上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之，上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of equipment for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into being embodied by multiple modules or units.

此外，尽管在附图中以特定顺序描述了本公开中方法的各个步骤，但是，这并非要求或者暗示必须按照该特定顺序来执行这些步骤，或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的，可以省略某些步骤，将多个步骤合并为一个步骤执行，以及/或者将一个步骤分解为多个步骤执行等。Furthermore, although various steps of the methods of the present disclosure are depicted in the drawings in a specific order, this does not require or imply that the steps must be performed in that specific order, or that all of the illustrated steps must be performed to achieve the desired results. result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.

在本公开的示例性实施例中，还提供了一种能够实现上述方法的电子设备。In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

所属技术领域的技术人员能够理解，本发明的各个方面可以实现为系统、方法或程序产品。因此，本发明的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art will understand that various aspects of the present invention may be implemented as systems, methods or program products. Therefore, various aspects of the present invention can be implemented in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System".

下面参照图6来描述根据本发明的这种实施方式的电子设备600。图6显示的电子设备600仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。An electronic device 600 according to this embodiment of the invention is described below with reference to FIG. 6 . The electronic device 600 shown in FIG. 6 is only an example and should not impose any limitations on the functions and usage scope of the embodiments of the present invention.

如图6所示，电子设备600以通用计算设备的形式表现。电子设备600的组件可以包括但不限于：上述至少一个处理单元610、上述至少一个存储单元620、连接不同系统组件(包括存储单元620和处理单元610)的总线630。As shown in Figure 6, electronic device 600 is embodied in the form of a general computing device. The components of the electronic device 600 may include, but are not limited to: the above-mentioned at least one processing unit 610, the above-mentioned at least one storage unit 620, and a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610).

其中，所述存储单元存储有程序代码，所述程序代码可以被所述处理单元610执行，使得所述处理单元610执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。例如，所述处理单元610可以执行如图1中所示的步骤。Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 performs various exemplary methods according to the present invention described in the above-mentioned "Example Method" section of this specification. Implementation steps. For example, the processing unit 610 may perform steps as shown in FIG. 1 .

存储单元620可以包括易失性存储单元形式的可读介质，例如随机存取存储单元(RAM)6201和/或高速缓存存储单元6202，还可以进一步包括只读存储单元(ROM)6203。The storage unit 620 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 6201 and/or a cache storage unit 6202, and may further include a read-only storage unit (ROM) 6203.

存储单元620还可以包括具有一组(至少一个)程序模块6205的程序/实用工具6204，这样的程序模块6205包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。Storage unit 620 may also include a program/utility 6204 having a set of (at least one) program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples, or some combination, may include the implementation of a network environment.

总线630可以为表示几类总线结构中的一种或多种，包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。Bus 630 may be a local area representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or using any of a variety of bus structures. bus.

显示单元640可以为具有显示功能的显示器，以通过该显示器展示由处理单元610执行本示例性实施例中的方法而得到的处理结果。显示器包括但不限于液晶显示器或者是其它显示器。The display unit 640 may be a display with a display function to display the processing results obtained by the processing unit 610 executing the method in this exemplary embodiment through the display. Displays include but are not limited to liquid crystal displays or other displays.

电子设备600也可以与一个或多个外部设备800(例如键盘、指向设备、蓝牙设备等)通信，还可与一个或者多个使得用户能与该电子设备600交互的设备通信，和/或与使得该电子设备600能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且，电子设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图所示，网络适配器660通过总线630与电子设备600的其它模块通信。应当明白，尽管图中未示出，可以结合电子设备600使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。Electronic device 600 may also communicate with one or more external devices 800 (e.g., keyboard, pointing device, Bluetooth device, etc.), may also communicate with one or more devices that enable a user to interact with electronic device 600, and/or with Any device (eg, router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 650. Furthermore, the electronic device 600 may also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660. As shown, network adapter 660 communicates with other modules of electronic device 600 via bus 630. It should be understood that, although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本公开实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。Through the above description of the embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by software combined with necessary hardware. Therefore, the technical solution according to the embodiment of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , including several instructions to cause a computing device (which may be a personal computer, a server, a terminal device, a network device, etc.) to execute a method according to an embodiment of the present disclosure.

在本公开的示例性实施例中，还提供了一种计算机可读存储介质，其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中，本发明的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当所述程序产品在终端设备上运行时，所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本发明各种示例性实施方式的步骤。In an exemplary embodiment of the present disclosure, a computer-readable storage medium is also provided, on which a program product capable of implementing the method described above in this specification is stored. In some possible implementations, various aspects of the present invention can also be implemented in the form of a program product, which includes program code. When the program product is run on a terminal device, the program code is used to cause the The terminal device performs the steps according to various exemplary embodiments of the present invention described in the "Exemplary Method" section above in this specification.

参考图7所示，描述了根据本发明的实施方式的用于实现上述方法的程序产品700，其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码，并可以在终端设备，例如个人电脑上运行。然而，本发明的程序产品不限于此，在本文件中，可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Referring to Figure 7, a program product 700 for implementing the above method according to an embodiment of the present invention is described, which can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be used on a terminal device, For example, run on a personal computer. However, the program product of the present invention is not limited thereto. In this document, a readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, apparatus or device.

所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may take the form of any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了可读程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质，该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A readable signal medium may also be any readable medium other than a readable storage medium that can send, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or device.

可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于无线、有线、光缆、RF等等，或者上述的任意合适的组合。Program code embodied on a readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言的任意组合来编写用于执行本发明操作的程序代码，所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。Program code for performing the operations of the present invention may be written in any combination of one or more programming languages, including object-oriented programming languages such as Java, C++, etc., as well as conventional procedural Programming language—such as "C" or a similar programming language. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on. In situations involving remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device, such as provided by an Internet service. (business comes via Internet connection).

此外，上述附图仅是根据本发明示例性实施例的方法所包括的处理的示意性说明，而不是限制目的。易于理解，上述附图所示的处理并不表明或限制这些处理的时间顺序。另外，也易于理解，这些处理可以是例如在多个模块中同步或异步执行的。Furthermore, the above-mentioned drawings are only schematic illustrations of processes included in methods according to exemplary embodiments of the present invention, and are not intended to be limiting. It is readily understood that the processes shown in the above figures do not indicate or limit the temporal sequence of these processes. In addition, it is also easy to understand that these processes may be executed synchronously or asynchronously in multiple modules, for example.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其他实施例。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由所附的权利要求指出。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech processing method, characterized by comprising:

After turning on the pitch-changing sound effect, receive the voice signal acquired and sent by the audio collection device;

The time domain signal corresponding to the speech signal is divided into frames, the time domain signal corresponding to the framed speech signal is windowed, and the time domain signal corresponding to the windowed speech signal is added according to an interpolation algorithm or an extraction algorithm. Perform pitch-shifting processing to adjust the sampling frequency to obtain the pitch-shifted speech signal;

According to the corresponding relationship between the playback time and the length of each frame of speech signal, the pitch-modified speech signal is spliced so that the length of the speech signal remains consistent, and the time domain signal corresponding to the pitch-modified speech signal is maintained in playback time. The target speech signal is obtained when the playback time of the modulated speech signal is the same as the playback time of the speech signal;

Wherein, maintaining the playback time of the time domain signal corresponding to the pitch-modified speech signal, and obtaining the target speech signal when the playback time of the pitch-modified speech signal is the same as the playback time of the speech signal includes:

Determine the comparison result between the time series variable and the overlap length between the two frames of speech signals obtained by dividing the time domain signal corresponding to the speech signal into frames;

If the timing variable is less than the overlap length, when the offset is such that the noise synthesized by the playback time reduction of the modulated speech signal is minimized, according to the offset, the length of each frame of the speech signal and the overlap length, The difference between and the overlap length between the two frames of speech signals after the pitch change determines the target speech signal; the offset represents the distance between the optimal matching point and the m-th window;

If the timing variable is greater than or equal to the overlap length and does not exceed the length of each frame of the speech signal before the pitch change, then after length alignment, the speech signal after the pitch change is used as the target speech signal.

2. The speech processing method according to claim 1, characterized in that windowing the framed time domain signal includes:

A Hamming window is used to perform the windowing process on the time domain signal of the framed speech signal.

3. The speech processing method according to claim 1, characterized in that the time domain signal corresponding to the windowed speech signal is processed according to an interpolation algorithm or an extraction algorithm, and the pitch-modified speech signal obtained includes: :

The pitch-modified speech signal is determined according to the sampling frequency of the speech signal, the sampling frequency of the pitch-modified speech signal, and the length of each frame of the speech signal.

4. The speech processing method according to claim 1, characterized in that the rising tone of the voice signal corresponds to the increase in the playback time of the voice signal after the tone change, and the falling tone of the voice signal corresponds to the playback of the voice signal after the tone change. Time decreases.

5. A voice processing device, characterized by comprising:

The voice acquisition module is used to receive the voice signal acquired and sent by the audio acquisition device after turning on the pitch-changing sound effect;

A speech tone modification module, used to frame the time domain signal corresponding to the speech signal, add windows to the time domain signal corresponding to the framed speech signal, and perform windowing on the windowed speech according to an interpolation algorithm or an extraction algorithm. The time domain signal corresponding to the signal undergoes pitch modification processing to adjust the sampling frequency to obtain the pitch-modified speech signal;

The time maintenance module is used to splice the pitch-modified speech signals according to the corresponding relationship between the playback time and the length of each frame of speech signal so that the length of the speech signal remains consistent, and to perform the time domain signal corresponding to the pitch-modified speech signal. The playback time is maintained, and the target voice signal is obtained when the playback time of the pitch-modified voice signal is the same as the playback time of the voice signal; wherein the playback time of the pitch-modified voice signal is equal to the playback time of the voice signal. same;

6. An electronic device, characterized in that it includes:

processor; and

memory for storing executable instructions for the processor;

Wherein, the processor is configured to execute the speech processing method according to any one of claims 1-4 by executing the executable instructions.

7. A computer-readable storage medium on which a computer program is stored, characterized in that when the computer program is executed by a processor, the speech processing method according to any one of claims 1-4 is implemented.