CN112908351A

CN112908351A - Audio tone changing method, device, equipment and storage medium

Info

Publication number: CN112908351A
Application number: CN202110083776.4A
Authority: CN
Inventors: 张超鹏
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-06-04

Abstract

The application discloses an audio tone changing method, an audio tone changing device, audio equipment and a computer readable storage medium, wherein the method comprises the following steps: carrying out frequency conversion processing on an input audio signal to obtain a frequency conversion signal; extracting a first formant envelope corresponding to an input audio signal and a second formant envelope corresponding to a variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope; carrying out weighting modulation output processing on the variable frequency spectrum distribution of the variable frequency signal by using the spectrum coefficient to obtain output audio; the method utilizes the spectral coefficient to carry out weighting and tone-changing output, namely, corrects the frequency-changing signal based on the input audio signal in a weighting mode, so that the obtained output audio can keep the tone consistent with the input audio signal, and the conditions that the tone can not be kept and the audio quality is poor after tone changing are avoided.

Description

Audio tone changing method, device, equipment and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio tone modification method, an audio tone modification apparatus, an audio tone modification device, and a computer-readable storage medium.

Background

Audio (Audio) generally refers to sound waves audible to the human ear with sound frequencies between 20Hz and 20 kHz. For some audio (music, songs, etc.), it may be transposed as needed to turn the pitch of the audio up or down to make the audio feel better. In the related art, after the audio is modified, the obtained audio has a poorer timbre than the input audio, and the timbre keeping effect is poor.

Disclosure of Invention

In view of the above, an object of the present invention is to provide an audio tone modifying method, an audio tone modifying apparatus, an audio tone modifying device and a computer readable storage medium, which modify a frequency-converted signal based on an input audio signal, so that the obtained output audio can maintain a tone consistent with the input audio signal, and the tone of the output audio is prevented from being degraded.

In order to solve the above technical problem, in a first aspect, the present application provides an audio tonal modification method, including:

carrying out frequency conversion processing on an input audio signal to obtain a frequency conversion signal;

extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope;

and carrying out weighted tonal modification output processing on the frequency conversion frequency spectrum distribution of the frequency conversion signal by using the frequency spectrum coefficient to obtain an output audio.

Optionally, the generating spectral coefficients using the first formant envelope and the second formant envelope includes:

and obtaining the spectral coefficient by using the difference between the first formant envelope and the second formant envelope.

generating initial spectral coefficients using the formant envelopes;

and performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.

Optionally, the performing, by using the spectral coefficient, weighted tonal modification output processing on the frequency conversion spectral distribution of the frequency conversion signal to obtain an output audio includes:

multiplying the frequency spectrum coefficient by the frequency conversion frequency spectrum distribution to obtain a weighted frequency spectrum;

performing time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio;

and carrying out overlap-add processing on the time domain output audio to obtain the output audio.

Optionally, the extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency conversion signal includes:

respectively carrying out frame processing and frequency domain conversion processing based on fundamental frequency on the input audio signal and the frequency conversion signal to obtain frequency conversion spectrum distribution and input spectrum distribution corresponding to each frame;

respectively carrying out power spectrum calculation and smoothing processing on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum;

respectively carrying out cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum;

and respectively carrying out cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by utilizing a cepstrum lifting window to obtain the first formant envelope and the second formant envelope.

Optionally, the frequency conversion processing the input audio signal to obtain a frequency-converted signal includes:

performing frame division processing on the input audio signal, and acquiring a tonal modification coefficient corresponding to each input frame;

determining an execution sequence by using the pitch-changing coefficient, and sequentially carrying out sampling processing and speed changing processing on the input frame according to the execution sequence to obtain a frequency-changing frame;

and splicing the frequency conversion frames to obtain the frequency conversion signal.

Optionally, the determining an execution order by using the pitch change coefficients includes:

acquiring a plurality of tonal modification coefficients in a current processing period, and determining a median of each tonal modification coefficient;

and determining the execution sequence corresponding to each input frame in the processing period according to the size relation between the median and a preset threshold.

Optionally, the splicing the frequency conversion frames to obtain the frequency conversion signal includes:

splicing the frequency conversion frames to obtain an initial frequency conversion signal;

and smoothing the initial frequency conversion signal by using a gradually-in and gradually-out weighting window to obtain the frequency conversion signal.

In a second aspect, the present application provides an audio tonal modification apparatus comprising:

the frequency conversion module is used for carrying out frequency conversion processing on the input audio signal to obtain a frequency conversion signal;

the frequency spectrum coefficient generating module is used for respectively extracting formant envelopes corresponding to the input audio signal and the variable frequency signal and generating frequency spectrum coefficients by utilizing the formant envelopes;

and the output audio generation module is used for performing weighted tonal modification output processing on the frequency conversion frequency spectrum distribution of the frequency conversion signal by using the frequency spectrum coefficient to obtain output audio.

In a third aspect, the present application provides an audio tonal modification apparatus comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the audio tonal modification method.

In a fourth aspect, the present application provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the audio transposition method described above.

According to the audio tone changing method, frequency conversion processing is carried out on an input audio signal to obtain a frequency conversion signal; extracting a first formant envelope corresponding to an input audio signal and a second formant envelope corresponding to a variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope; and carrying out weighting modulation output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio.

Therefore, the method extracts formant envelopes of the input audio signal and the variable frequency signal after the variable frequency signal is obtained by carrying out variable frequency processing on the input audio signal. Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. The formant envelope is utilized to generate the spectral coefficient, and the spectral coefficient is utilized to carry out weighting and tone-changing output, namely, the variable frequency signal is corrected based on the input audio signal in a weighting mode, so that the obtained output audio can keep the tone consistent with the input audio signal, the condition that the tone cannot be kept and the audio quality is poor after tone changing is avoided, and the problems that the audio quality is poor and the tone keeping effect is poor after tone changing are solved.

In addition, the application also provides an audio frequency tone changing device, an audio frequency tone changing device and a computer readable storage medium, and the beneficial effects are also achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic diagram of a hardware component framework for an audio pitch modification method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a hardware component framework for another audio pitch modification method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an audio pitch modification method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a variable frequency signal obtaining process according to an embodiment of the present disclosure;

fig. 5 is a logic diagram of a variable frequency signal acquisition process according to an embodiment of the present application;

fig. 6 is a schematic flow chart of a formant envelope extraction process provided in the embodiments of the present application;

fig. 7 is a graph illustrating a signal curve and a corresponding formant envelope curve according to an embodiment of the present disclosure;

FIG. 8 is a schematic illustration of a cepstral function curve and a cepstral curve provided in an embodiment of the present application;

fig. 9 is a schematic flowchart of a spectral coefficient obtaining process according to an embodiment of the present application;

fig. 10 is a flowchart illustrating an output audio obtaining process according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an audio pitch-changing device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, when a user tunes audio (e.g., voice, pure music, song, etc.), tools such as world or TSMtoolbox are commonly used. The world analyzes and synthesizes the audio by adopting a synthesis tone-changing mode, and realizes the effect of tone changing. However, the quality of the signal obtained after the transposition is greatly affected by the fundamental frequency parameters, the spectral characteristics and the aperiodic characteristic parameters, and the tone color preservation effect is poor. TSMtoolbox uses HPS (harmonic Permission separation) to perform variable speed processing (WSOLA/PV, overlay-Add Technique Based On wave form Similarity/Phase Vocode) On hitting sound and harmonic sound respectively, then uses resampling module (sample) to realize frequency conversion, and uses cepstrum multi-iteration mode to extract formant envelope of frequency conversion signal, and uses formant envelope to obtain output audio. The whole process has large calculation amount, the tone of the output audio is poor, and the tone keeping effect is poor. According to the audio tone changing method provided by the embodiment of the application, after the input audio signal is subjected to frequency conversion processing to obtain the frequency conversion signal, the formant envelopes of the input audio signal and the frequency conversion signal are extracted. Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. The formant envelope is utilized to generate a spectral coefficient, and the spectral coefficient is utilized to perform weighting and tone-changing output, namely, in a weighting mode, a frequency-changing signal is corrected based on an input audio signal, so that the obtained output audio can keep the tone consistent with the input audio signal, and the situations that the tone can not be kept and the audio quality is poor after tone changing are avoided.

For convenience of understanding, a hardware composition framework used in a scheme corresponding to the audio tonal modification method provided in the embodiment of the present application is described first. Referring to fig. 1, fig. 1 is a schematic diagram of a hardware composition framework for an audio pitch modification method according to an embodiment of the present disclosure. Wherein the audio tonal apparatus 100 may comprise a processor 101 and a memory 102 and may further comprise one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.

Wherein, the processor 101 is configured to control the overall operation of the audio tonal modification apparatus 100 to complete all or part of the steps in the audio tonal modification method; the memory 102 is used to store various types of data to support the operation at the audio tonal modification device 100, which data may include, for example, instructions for any application or method operating on the audio tonal modification device 100, as well as application related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk. In the present embodiment, the memory 102 stores therein at least programs and/or data for realizing the following functions:

and carrying out weighting modulation output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio.

The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the audio transposition apparatus 100 and other apparatuses. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.

The audio pitch changing apparatus 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for performing the audio pitch changing method.

Of course, the structure of the audio tonal apparatus 100 shown in fig. 1 does not constitute a limitation of the audio tonal apparatus in the embodiments of the present application, and in practical applications the audio tonal apparatus 100 may comprise more or less components than those shown in fig. 1, or some components in combination.

The audio tonal modification apparatus 100 in fig. 1 may be a terminal (e.g., a mobile terminal such as a mobile phone and a tablet computer, or a fixed terminal such as a PC) or a server. In a specific embodiment, the audio transposition device 100 can utilize the communication component 105 to receive audio transmitted by other devices or terminals through a network; in another specific embodiment, the audio transposition device 100 can acquire input audio using the multimedia component 103; in another particular embodiment, the audio transposition device 100 can retrieve the input audio from the memory 102.

It is to be understood that, in the embodiment of the present application, the number of the audio tonal modification apparatuses is not limited, and it may be that a plurality of audio tonal modification apparatuses cooperate together to complete the audio tonal modification method. In a possible implementation manner, please refer to fig. 2, and fig. 2 is a schematic diagram of a hardware composition framework to which another audio pitch modification method provided in the embodiment of the present application is applied. As can be seen from fig. 2, the hardware composition framework may include: the first audio transposition device 11 and the second audio transposition device 12 are connected by a network 13.

In the embodiment of the present application, the hardware structures of the first audio transposition device 11 and the second audio transposition device 12 may refer to the audio transposition device 100 in fig. 1. That is, it can be understood that there are two audio tonal modification devices 100 in the present embodiment, and the two devices perform data interaction, so as to achieve the effect of tonal modification of audio. Further, in this embodiment of the application, the form of the network 13 is not limited, that is, the network 13 may be a wireless network (e.g., WIFI, bluetooth, etc.), or may be a wired network.

The first audio tonal modification device 11 and the second audio tonal modification device 12 may be the same electronic device, for example, both the first audio tonal modification device 11 and the second audio tonal modification device 12 are servers; it may also be different types of electronic devices, for example, the first audio transposition device 11 may be a terminal or an intelligent electronic device, and the second audio transposition device 12 may be a server. In a possible embodiment, a server with high computing power can be used as the second audio transposition device 12 to improve the data processing efficiency and reliability, and thus the processing efficiency of audio transposition. Meanwhile, a terminal or an intelligent electronic device with low cost and wide application range is used as the first audio tonal modification device 11 to realize the interaction between the second audio tonal modification device 12 and the user. It can be understood that the audio transposition process performed on the input audio signal necessarily exists in the first step of obtaining the input audio signal, so the interaction process may be as follows: the terminal sends the input audio signal to the server after acquiring the input audio signal, and the server processes the input audio signal after receiving the input audio signal to obtain the output audio. In another embodiment, a user may set a relevant parameter for the frequency conversion processing at the terminal, where the interaction process may be: and after the terminal utilizes the related parameters of the frequency conversion processing to convert the frequency of the input audio signal, the terminal sends the frequency conversion signal and the input audio signal to the server so that the server can execute other steps.

Based on the above description, please refer to fig. 3, and fig. 3 is a flowchart illustrating an audio pitch modification method according to an embodiment of the present application. The method in this embodiment comprises:

s101: and carrying out frequency conversion processing on the input audio signal to obtain a frequency conversion signal.

The input audio signal may be audio in any format, such as WAV, MP3, MPEG-4, etc., which is not limited in this embodiment. The processed input audio signal can be one, that is, only one input audio signal is subjected to frequency conversion processing at one moment, and a new input audio signal is obtained after the frequency conversion signal of the input audio signal is obtained; in another embodiment, a plurality of input audio signals may be obtained, and the frequency conversion processing may be performed on the plurality of input audio signals at the same time to obtain frequency-converted signals. For example, a plurality of parallel channels may be provided, each of which may perform a frequency conversion process on an input audio signal, and when the input audio signal in one channel is processed, a new input audio signal may be acquired and frequency-converted in the channel until there is no input audio signal. The channels are independent and do not influence each other. After the input audio signal is obtained, the frequency conversion signal can be obtained by directly performing frequency conversion processing on the input audio signal, or necessary processing can be performed on the input audio signal, and after the processing is finished, the frequency conversion processing is performed on the processed input audio signal to obtain a corresponding frequency conversion signal.

It is understood that the input audio signal needs to be acquired because it needs to be frequency-converted. The embodiment does not limit the specific acquisition mode of the input audio signal, and for example, the input audio signal may be acquired by an audio acquisition device, which may be a microphone on an intelligent terminal; in another specific embodiment, a specified audio may be read from the storage medium and used as the input audio signal. It should be noted that the storage medium may be a local storage medium, or may be a portable storage medium, for example, a usb disk or a removable hard disk; in another specific embodiment, the input audio signal may be transmitted by other devices or terminals, for example, a smart phone, a server, etc., in a wired or wireless manner.

The embodiment does not limit the specific processing manner of the frequency conversion processing, and it can be understood that the frequency conversion process is to perform compression or expansion in the time domain range on the signal, so in a specific implementation, the frequency conversion processing may be to perform variable speed processing (i.e., time domain compression or expansion) first and then perform resampling processing; in another specific embodiment, the resampling process may be performed first, and then the variable speed process may be performed. Further, it should be noted that the present embodiment does not limit the specific sampling manner of resampling, for example, the resampling may be up-sampling, or the down-sampling may be performed, and the specific sampling manner may be selected according to the requirement of the frequency conversion processing. The present embodiment does not limit the number of times of frequency conversion processing and the processing range, and it should be noted that the number of times of processing is used to count the total number of processing on the entire input audio signal, and the processing range is used to specify which part of the input audio signal is to be frequency converted. The processing times and the processing range have a correlation relationship, and the larger the processing range of each time is, the fewer the processing times are; the smaller the range of each process, the more the number of processes. In this embodiment, the range of each processing may be referred to as a frame, and when the input audio signal is processed, it may be firstly framed and each frame may be subjected to a frequency conversion processing. Further, in a possible implementation manner, the processing manner of each frame is different, i.e., different frequency conversion processing may be performed on different frames, so as to perform different frequency conversion processing on each part of the input audio signal according to actual needs, and finally, each frame is spliced to obtain a corresponding frequency conversion signal.

S102: and extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the variable frequency signal, and generating spectral coefficients by using the first formant envelope and the second formant envelope.

Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. It should be noted that the formant envelope in this embodiment is not the envelope curve itself, but curve information corresponding to the envelope curve, and specifically, the formant envelope may be in the form of a log power spectrum. After the input signal and the frequency conversion signal are obtained, formant envelopes of the input signal and the frequency conversion signal are respectively extracted, namely, the input signal and the frequency conversion signal are respectively taken as extracted signals to extract formant envelopes, and a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency conversion signal are obtained. The embodiment does not limit the specific extraction manner of the formant envelope, and for example, the formant envelope may be extracted by adopting a cepstrum multiple iteration manner, or may be extracted by adopting other manners. And after the first formant envelope and the second formant envelope are obtained, generating a frequency spectrum coefficient by using the first formant envelope and the second formant envelope.

The spectral coefficients can reflect the difference between the input audio signal and the frequency-converted signal, and are obtained based on the formant envelope, so that the difference of the envelopes between the input audio signal and the frequency-converted signal can be reflected, and the difference of the envelopes is the difference of timbre. The frequency conversion signal can be corrected based on the input audio signal by utilizing the spectral coefficient to participate in the generation of the output audio, so that the obtained output audio can keep the tone consistent with the input audio signal, and the condition that the tone cannot be kept and the audio quality is poor after the tone conversion is avoided. In this embodiment, the number of formant envelopes and spectral coefficients corresponding to the input audio signal and the frequency conversion signal, respectively, is not limited, and in a specific implementation, the input audio signal and the frequency conversion signal may correspond to one formant envelope, respectively, in which case the number of spectral coefficients is also one. For example, when the input audio signal is short, the corresponding first formant envelope and the corresponding second formant envelope of the frequency-converted signal can be directly extracted, and the two formant envelopes are used to obtain the corresponding spectral coefficients. In another specific embodiment, in order to achieve the effect of varying the pitch of different portions of the input audio signal, the formant envelopes of the frames of the input audio signal may be extracted separately when extracting the formant envelopes, so that a plurality of first formant envelopes are shared, each corresponding to a different frame of the input audio signal. Accordingly, when the second formant envelope corresponding to the frequency-converted signal is extracted, the second formant envelope may be extracted for each frame of the frequency-converted signal. It can be understood that the frequency conversion signal and the input audio signal should adopt the same framing manner, so that the first formant envelope of each frame of the input audio signal and the second formant envelope of each frame of the frequency conversion signal can be in one-to-one correspondence, and the corresponding first formant envelope and the second formant envelope can be used to obtain the correctly corresponding spectral coefficients.

In one embodiment, coefficients obtained by directly using the first formant envelope and the second formant envelope may be determined as spectral coefficients. In another embodiment, the directly obtained coefficients may be optimized, and the corresponding spectral coefficients are obtained after the optimization. The specific way of the optimization process is not limited, and for example, the optimization process may be a convolution smoothing process, or may be a linear suppression process, or may be an adjacent coefficient smoothing process.

S103: and carrying out weighting modulation output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio.

After obtaining the spectral coefficients, the spectral coefficients may be used to perform a weighted tonal modification output process on the frequency-converted spectral distribution of the frequency-converted signal. The frequency conversion spectrum distribution is the distribution of the frequency conversion signal in the frequency domain, and the frequency conversion spectrum distribution can be weighted and adjusted in the frequency domain through weighting and tone-changing output, namely, the effect of correcting the frequency conversion signal based on the input audio signal is realized through a weighting mode, and after the correction is finished, the frequency conversion spectrum distribution is converted from the frequency domain to the time domain to finish the tone-changing output. It should be noted that, corresponding to step S102, if there are a plurality of spectral coefficients, the frequency spectrum corresponding to each frame in the frequency-converted signal is not subjected to weighted tonal modification output processing by using each spectral coefficient, and signal reconstruction is performed after processing, so that corresponding output audio can be obtained. The embodiment does not limit the specific signal reconstruction method, and for example, the signal reconstruction method may be performed by using overlap-add to obtain the output audio.

By applying the audio frequency tone changing method provided by the embodiment of the application, after the frequency conversion processing is carried out on the input audio signal to obtain the frequency conversion signal, the formant envelopes of the input audio signal and the frequency conversion signal are extracted. Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. The formant envelope is utilized to generate the spectral coefficient, and the spectral coefficient is utilized to carry out weighting and tone-changing output, namely, the variable frequency signal is corrected based on the input audio signal in a weighting mode, so that the obtained output audio can keep the tone consistent with the input audio signal, the condition that the tone cannot be kept and the audio quality is poor after tone changing is avoided, and the problems that the audio quality is poor and the tone keeping effect is poor after tone changing are solved.

In a specific implementation manner, an embodiment of the present application provides a specific frequency conversion processing procedure. Referring to fig. 4, fig. 4 is a schematic flow chart of a variable frequency signal acquiring process according to an embodiment of the present application, where the schematic flow chart includes:

s1011: the input audio signal is subjected to framing processing, and a pitch change coefficient corresponding to each input frame is acquired.

In this embodiment, the input audio signal may be subjected to frame division processing to obtain a plurality of input frames, and each input frame may be subjected to different degrees of frequency conversion processing. The specific manner of the framing processing is not limited, for example, the length of each frame may be preset, and the input audio signal is framed according to the length; or the frame number corresponding to the input audio signal may be preset, and the input audio signal may be framed according to the frame number. What degree of frequency conversion processing is performed on each input frame can be represented by a pitch coefficient, and the pitch base number can be represented by γ. The key transposition coefficient may be manually input by a user, or may be transmitted by other devices or terminals, or may be pre-stored locally. The pitch adjustment coefficient is used for comparing with a preset threshold value to determine how to perform pitch adjustment, that is, performing pitch increasing processing or performing pitch decreasing processing, and the specific size of the preset threshold value is not limited in this embodiment, and may be 1, for example. When the key change coefficient is greater than 1, determining that the up-scaling processing needs to be performed on the input frame; when the key change coefficient is less than 1, it may be determined that the key reduction process needs to be performed on the input frame. In another embodiment, the magnitude of the key change coefficient may also be related to the degree of pitch up or pitch down, e.g., the range of the key change coefficient may be limited to [0, 2], and the greater the degree of pitch up in this range, the greater the degree of pitch up. Accordingly, the greater the degree of less than 1 in this range, the greater the degree of pitch reduction.

S1012: and determining an execution sequence by using the pitch-changing coefficient, and sequentially carrying out sampling processing and speed-changing processing on the input frame according to the execution sequence to obtain a variable frequency frame.

In the present embodiment, the frequency conversion processing of the input frame is realized by way of sampling processing for resampling the input frame and variable speed processing for time domain compression or expansion. In order to ensure that the sound quality of the frequency-converted signal is maintained, different frequency conversion processing methods may be provided for different tone conversion processes. Specifically, after the pitch-modulated coefficients are obtained, an execution sequence can be determined by using the pitch-modulated coefficients, so that sampling processing and speed change processing are performed on the input frames sequentially according to the execution sequence to obtain corresponding variable frequency frames. For example, when the pitch modulation coefficient is smaller than 1, it indicates that the pitch reduction processing needs to be performed, specifically, the input frame may be first up-sampled and then time-domain compressed (i.e., speed-variable processing) is performed to obtain a corresponding frequency-variable frame. Or when the pitch-changing coefficient is larger than 1, indicating that the pitch-increasing processing is needed, at this time, firstly performing time domain expansion processing on the input frame, and performing down-sampling after the expansion to obtain the corresponding frequency-changing frame.

Referring to fig. 5, fig. 5 is a logic diagram of a variable frequency signal obtaining process according to an embodiment of the present disclosure. (a) The case applies to the case where the transposition coefficient is less than 1, first for the signal x corresponding to the input frame (or the entire input audio signal)_in(n) sampling (i.e. resampling) to obtain intermediate signal x_r(k, l), then, the intermediate signal is processed with variable speed (i.e. PV/PSOLA, Phase Vocoder/pitch syndrome overlap, Phase Vocoder/base frequency synchronous superposition processing) to obtain the frequency conversion frame (or whole frequency conversion signal) x_fs(n) of (a). (b) The condition is suitable for the condition that the tonal modification coefficient is larger than 1, firstly, the signal corresponding to the input audio signal is subjected to variable speed processing to obtain an intermediate signal x_ts(k, l), sampling the intermediate signal to obtain a frequency conversion frame x_fs(n) of (a). Where n denotes a frame number.

S1013: and splicing the frequency conversion frames to obtain frequency conversion signals.

After all the frequency conversion frames corresponding to the input audio signal are obtained, all the frequency conversion frames are spliced to obtain the frequency conversion signal. The embodiment does not limit the specific splicing manner, for example, each frequency conversion frame may be directly subjected to time domain splicing, or may be subjected to certain processing after being directly spliced, so as to obtain a frequency conversion signal.

By applying the audio frequency tonal modification method provided by the embodiment of the application, different parts in the input audio signal can be tonal modified to different degrees by framing the input audio signal and setting a tonal modification coefficient for each input frame. Meanwhile, a proper frequency conversion mode can be selected for the corresponding part according to the tone variation coefficient so as to maintain the tone quality to the maximum extent and further prepare for maintaining the tone color after subsequent tone variation to the maximum extent.

Based on the above embodiment, in a possible implementation, the pitch-shifted coefficients of the adjacent input frames are in different intervals, i.e. switching back and forth between states greater than 1 and less than 1, which may result in a sense of hearing loss, specifically, a "click" sound at the switching point of two frequency-shifted frames. In order to solve the above problem, the S1013 step may include:

step 11: and splicing the frequency conversion frames to obtain an initial frequency conversion signal.

The specific splicing manner is not limited in this embodiment, and for example, the splicing may be performed directly, that is, a plurality of frequency conversion frames are connected according to the sequence of the corresponding input frames, so that the initial frequency conversion signal may be obtained.

Step 12: and smoothing the initial frequency conversion signal by using a gradually-in and gradually-out weighting window to obtain a frequency conversion signal.

The fade-in/fade-out weighting window can perform fade-in/fade-out processing on the signal, and avoid the occurrence of 'click' sound caused by discontinuity of the signal at the frame switching position. Specifically, for example, the fade-in and fade-out weighting window performs smoothing processing on the initial frequency conversion signal, but the present embodiment does not limit the specific processing manner of the smoothing processing, for example, a part of the initial frequency conversion signal may be smoothed, and since the problem of signal discontinuity only occurs at the switching position of two frames, the splicing position of the signal, that is, the switching position of two frames may be suppressed by using the fade-in and fade-out weighting window, so as to perform smoothing processing on the initial frequency conversion signal.

By applying the audio frequency tone changing method provided by the embodiment of the application, the initial variable frequency signal obtained by direct splicing is subjected to smooth processing by utilizing the gradually-in and gradually-out weighting window, so that the problem of discontinuity in hearing sense caused by signal discontinuity can be avoided.

In another embodiment, discontinuity of signals can be avoided by determining an execution sequence corresponding to each input frame in the processing cycle, so as to avoid a discontinuity problem in hearing, and the step S1013 may include:

step 21: and acquiring and processing a plurality of pitch coefficients in the current period, and determining the median of each pitch coefficient.

In this embodiment, processing cycles are provided, each processing cycle is connected, and each processing cycle includes a plurality of connected variable frequency frames (including corresponding input frames). When determining the execution order corresponding to a certain input frame, the pitch coefficients corresponding to all input frames in the processing cycle (i.e., the current processing cycle) in which the certain input frame is located may be obtained, and the median of the pitch coefficients may be determined. The median value may represent a main median region of each pitch coefficient corresponding to the current processing cycle. The present embodiment does not limit the specific arrangement manner of the processing cycles, and for example, a preset number of input frames may be set for each processing cycle, or a preset number of processing cycles may be set for each input audio signal. In another embodiment, it may also be detected whether there are multiple consecutive input frames whose corresponding pitch shift coefficients are switched back and forth between states greater than 1 and less than 1, and if there are multiple consecutive input frames, the input frames are divided into the same processing cycle.

Step 22: and determining the execution sequence corresponding to each input frame in the processing period according to the size relation between the median and a preset threshold.

Comparing the median value with a preset threshold value, determining the magnitude relation between the median value and the preset threshold value, wherein the magnitude relation can be used for representing the execution sequence corresponding to most input frames in the current processing period, determining the execution sequence as the execution sequence corresponding to each input frame in the current processing period, and performing frequency conversion processing on each input frame in the current processing period by adopting the same execution sequence, so that the frequency conversion frames obtained after processing are continuous on signals, and further the signals are continuous on the auditory sense.

By applying the audio frequency tone changing method provided by the embodiment of the application, the discontinuity on the signal and the discontinuity on the auditory sense can be avoided without performing other processing on the signal directly obtained after frequency conversion processing.

In another possible embodiment, the above two modes may be combined, that is, after the input frames in the current processing cycle are processed in the same execution order, the switching positions of the processing cycles may be processed by using the fade-in and fade-out weighting window.

Based on the foregoing embodiments, in a specific implementation manner, the present application provides a specific formant envelope extraction process. Referring to fig. 6, fig. 6 is a schematic flow chart of a formant envelope extraction process according to an embodiment of the present application, which includes:

s1021: and respectively carrying out frame division processing and frequency domain conversion processing based on fundamental frequency on the input audio signal and the frequency conversion signal to obtain frequency conversion spectrum distribution and input spectrum distribution corresponding to each frame.

Specifically, the framing processing based on the fundamental frequency is to determine a framing window by using the fundamental frequency, frame the input audio signal and the frequency-converted signal according to the framing window, perform frequency domain conversion processing after the framing, convert the input audio signal and the frequency-converted signal into frequency domain signals, and obtain frequency-converted spectrum distribution and input spectrum distribution corresponding to each frame. Specifically, the fundamental frequency is the fundamental frequency of the input signal, and a specific obtaining manner thereof is not limited, for example, a fundamental frequency extracting tool may be used for extracting, or fundamental frequency information may be obtained, and the fundamental frequency information is used for specifying a specific size of the fundamental frequency. The specific form of the fundamental frequency extracting tool is not limited, and may be a harvest tool, a Pyin tool, or a creee tool, for example.

Specifically, the length of the fundamental frequency period sequence may be T₀When framing is performed, the frame may be divided into 3T frames₀The window length is 1.5T for each side₀Is used to frame the input audio signal and the frequency converted signal. And a framing window (Hanning window) is taken as a window function to obtain a framed frame signal sequence, and meanwhile, the frequency domain conversion processing is carried out through STFT (short time Fourier transform) to obtain the frequency conversion spectrum distribution and the input spectrum distribution of the signal. The embodiment does not limit the window functionFor example, a hanning window is used as a window function, and in other embodiments, other window functions may be used. Specifically, when the input audio signal is x_inWhen (t), the frequency-converted signal is x_fs(t) the corresponding frequency conversion spectrum distribution and input spectrum distribution are:

wherein n represents the frame number after framing, F is the short time Fourier transform, w_hannFor Hanning Window processing, ω represents the digital angular frequency variable of the frame, X_in(ω, n) is the input spectral distribution, X_fsAnd (omega, n) is frequency conversion spectrum distribution, and t is time.

S1022: and respectively carrying out power spectrum calculation and smoothing treatment on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum.

And respectively carrying out power spectrum calculation on the input frequency spectrum distribution and the variable frequency spectrum distribution by using a power spectrum formula, and carrying out smoothing treatment according to a preset length after calculation. The specific size of the preset length is not limited in this embodiment, and for example, the power spectrum may be linearly smoothed by using 2/3 of the digital angular frequency corresponding to the fundamental frequency as the preset length. Specifically, the power spectrum formula is as follows:

where P (ω) is the pair signal F_TAnd (omega) performing power spectrum calculation to obtain a power spectrum directly, which can be called as an initial power spectrum, and performing smoothing treatment on the initial power spectrum to obtain the power spectrum. Wherein, F_T(ω) may specifically be X_in(ω, n) or X_fs(ω, n), i.e. the input spectral distribution corresponding to an arbitrary frame obtained after the input audio signal is framed,or the frequency conversion spectrum distribution corresponding to any frame obtained after the frequency conversion signal is framed.

Specifically, the smoothing process may be performed using the following formula:

wherein P is_s(ω) is the power spectrum, in particular, P can be used_{s_in}(ω) represents the input power spectrum, using P_{s_fs}(ω) represents the frequency-converted power spectrum, ω₀For the length of the fundamental frequency in the frequency domain, the length of the fundamental frequency characteristic is 2 omega₀The rectangular window of/3 linearly smoothes the power spectrum. In other embodiments, rectangular windows of other lengths may also be used for smoothing.

S1023: and respectively carrying out cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum.

The cepstrum processing is to first solve the logarithm and then perform cepstrum processing on the logarithm, and specifically, the cepstrum processing may be performed by using:

performing a logarithmic process, wherein log is a logarithmic function, P_s(ω) may specifically be P_{s_in}(ω), may be P_{s_fs}(ω)，

The intermediate log data. And use:

and (5) performing cepstrum processing. Wherein,

is a cepstrum. Depending on the input signal, this embodiment is advantageousBy using

A representation of the input cepstrum is made,

representing a frequency-converted cepstrum. Wherein F^-1For inverse fourier transformation, τ is the argument.

S1024: and respectively carrying out cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by using the cepstrum lifting window to obtain a first resonance peak envelope and a second resonance peak envelope.

After the input cepstrum and the frequency conversion cepstrum are obtained, a cepstrum lifting window can be constructed, cepstrum windowing processing is carried out on the input cepstrum and the frequency conversion cepstrum by using the cepstrum lifting window, frequency spectrum recovery processing is carried out, and the data are recovered to a frequency domain so as to execute subsequent steps. Specifically, the following may be mentioned:

a cepstrum lifting window is constructed where sin c represents the sampling function and q is a constant, and as a rule of thumb q-0.09. I is_sAs a sampling function, I_qIs a raised cosine function, w (τ) is a cepstrum lifting window function, and cos is a cosine function.

After obtaining the cepstrum lifting window, use is made of:

performing cepstral windowing using:

performing frequency spectrum recovery processing to obtain signal logarithmic formant envelope

In this embodiment, can utilize

Representing the log-formant envelope of a first signal corresponding to an input audio signal, using

Representing the second signal log formant envelope corresponding to the frequency converted signal. After the log-formant envelope of the signal is obtained, it is also necessary to utilize:

and obtaining a corresponding first formant envelope and a second formant envelope. Wherein, LEn_in(ω) is the first formant envelope, LEn_fsAnd (ω) is the second formant envelope. It should be noted that, since the input audio signal and the frequency-converted signal are framed, the obtained first formant envelope corresponds to a frame of the input audio signal, and the obtained second formant envelope corresponds to a frame of the frequency-converted signal corresponding to the frame of the input audio signal. After each frame is processed, a first formant envelope corresponding to the whole input audio signal and a second formant envelope corresponding to the whole variable frequency signal can be obtained.

Referring to fig. 7, fig. 7 is a graph illustrating a signal curve and a corresponding formant envelope curve according to an embodiment of the present disclosure. The method specifically comprises the comparison of curves corresponding to a power spectrum before and after smoothing treatment, wherein the abscissa is the distribution of frequency points (bins), and the ordinate is a frequency spectrum power decibel value. At a sampling rate (fs) of 48000Hz and a fundamental frequency (f)₀) For a voiced sound signal of 586Hz, for example, the corresponding bin (bin) is bin fs/f 0-82. The three curves in fig. 7 represent: log power spectrum curve of original signal (power spectrum)

Smoothed log spectrum curve (smoothed spectrum) obtained by linear smoothing

And signal log spectrum envelope curves (spectra enveloppe) obtained by spectrum lifting (i.e. cepstral windowing and spectrum restoration)

Referring to fig. 8, fig. 8 is a schematic diagram of a cepstrum function curve and a cepstrum curve according to an embodiment of the present disclosure. The method specifically includes curve comparison before and after cepstrum lifting, wherein the abscissa of a waveform diagram above the graph 8 is a frequency point, and the ordinate is a function amplitude value. The abscissa of the waveform diagram at the bottom of fig. 8 is a cepstrum point (quefrency), and the ordinate is a value of the logarithm of the cepstrum. The curves in the upper waveform diagram of FIG. 8 are the sampling function curves l_sI.e. l_s(τ), raised cosine curve l_qI.e. l_q(τ); cepstrum lifting window curve (l)_s.*l_q) I.e. w (τ). Also at the sampling rate (f)_s) 48000Hz, fundamental frequency (f)₀) For a voiced sound signal of 586Hz, for example, the corresponding bin (bin) is bin fs/f 0-82. The w (τ) curve ideally takes a value of zero at the fundamental frequency (bin 82) and its multiples, so that the influence of the excitation pulse sequence on the signal envelope extraction can be removed. The waveform diagram at the bottom of fig. 8 is a cepstrum curve of a 586Hz voiced sound signal, which includes a raw cepstrum curve and a weighted cepstrum curve (weighted) cepstrum curve after spectrum lifting.

By applying the audio tone changing method provided by the embodiment of the application, the formant envelopes corresponding to the input audio signal and the output audio can be obtained by only one round of calculation by utilizing the above mode, and the formant envelopes do not need to be obtained by adopting a mode of repeated iterative extraction, so that the calculation speed and the calculation efficiency are improved, and the consumption of calculation resources and time is reduced.

Based on the foregoing embodiments, in a specific implementation manner, in order to obtain accurate spectral coefficients, the step of generating the spectral coefficients using formant envelopes may include:

step 31: and obtaining a spectral coefficient by using the difference between the first formant envelope and the second formant envelope.

In this embodiment, the spectral coefficients may be represented by Mask (ω). In order to obtain accurate spectral coefficients, i.e. accurately represent the difference between the timbre of the frequency-converted signal and the input signal, the spectral coefficients may be calculated by subtraction. Specifically, the following steps are carried out:

Mask(ω)＝LEn_in(ω)-LEn_fs(ω)

and obtaining the spectral coefficient.

It should be noted that, step 31 is not the only way to calculate the spectral coefficients, and the audio pitch modification method provided by the present application may calculate the spectral coefficients by using other calculation ways, for example, may calculate the spectral coefficients by subtracting, dividing, and the like after weighting, as long as the difference between the timbres of the frequency-converted signal and the input signal can be represented.

In another embodiment, the initial spectral coefficient is directly calculated by utilizing the formant envelope, and the corresponding spectral coefficient can be obtained after the log-time spectral coefficient is optimized. Specifically, please refer to fig. 9, where fig. 9 is a schematic flow chart of a spectral coefficient obtaining process according to an embodiment of the present disclosure. The step of generating spectral coefficients using the formant envelopes may comprise:

s1025: initial spectral coefficients are generated using the first and second formant envelopes.

In this embodiment, the first formant envelope and the second formant envelope are directly used to generate not spectral coefficients but initial spectral coefficients. The initial spectral coefficients need to be optimized to obtain the spectral coefficients.

S1026: and performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.

In this embodiment, the optimization processing may include one or more of convolution smoothing processing, linear suppression processing based on a fundamental frequency, and adjacent coefficient smoothing processing, and one or more of the convolution smoothing processing, the linear suppression processing based on a fundamental frequency, and the adjacent coefficient smoothing processing may be selected as a specific processing mode of the optimization processing to optimize the initial spectral coefficient, so as to obtain the spectral coefficient. The convolution smoothing process can prevent the spectral coefficient from shaking too much, and can smooth the initial spectral coefficient by utilizing triangular window convolution. Specifically, the following steps are carried out:

M(ω)＝conv(Mask(ω),W_tri(ω))

convolution smoothing is performed, where conv is convolution calculation, and M (ω) is an initial spectral coefficient subjected to convolution smoothing, which may be directly used as a spectral coefficient or may be used as intermediate data to perform subsequent processing. Wherein, W_tri(ω) represents a triangular window function, which is a window function constructed using three data points, which may be, for example, W_tri(ω)＝[0.25,0.50,0.25]。

The linear suppression processing based on the fundamental frequency is used to suppress (i.e., fade-in processing) frequency segments of the initial spectral coefficients (or intermediate data subjected to other optimization processing) having frequencies smaller than the fundamental frequency, and it is possible to prevent unstable influence on the low-frequency signal energy due to the weighting processing (i.e., weighting portions in the weighting transposition output processing). For example, when performing linear suppression processing based on an extremum on the initial spectral coefficients subjected to the convolution smoothing processing, the following may be performed:

and performing linear inhibition processing based on the fundamental frequency. Wherein ω is₀The fundamental frequency is, the other is omega is not less than omega₀. The input value in the above formula is the initial spectral coefficient M (ω) subjected to the convolution smoothing process, and in other embodiments, the initial spectral coefficient or intermediate data subjected to other optimization processes may be directly used as the input value. The output value Ma (ω) in the above formula may be directly used as a spectral coefficient, or may be used as intermediate data in order to perform other optimization processing.

In order to ensure that the spectral coefficients are smoothly transited in the time domain, the spectral coefficients of the current frame may be further smoothed by using the spectral coefficients of the adjacent frames of each frame. Specifically, the following may be mentioned:

M_ak(ω,n)＝α·Ma(ω,n-1)+β·Ma(ω,n)+ξ·Ma(ω,n+1)

and performing adjacent coefficient smoothing processing, wherein alpha, beta and xi are real numbers with values of [0,1], and alpha + beta + xi is 1. The specific sizes of α, β, and ξ are not limited, and may be, for example, 1/3, or may be 0.25, 0.5, 0.25 in this order. The above formula performs the adjacent coefficient smoothing processing in the form of one frame for each of the left and right sides, and in another embodiment, the adjacent coefficient smoothing processing may be performed in the form of two frames, three frames, or more frames for each of the left and right sides, which is not limited in this embodiment. In this embodiment, the adjacent coefficient smoothing processing is the last optimization processing, and in other embodiments, the initial spectral coefficient may be optimized by performing three or two optimization processing in other execution sequences to obtain the spectral coefficient.

By applying the audio frequency tone changing method provided by the embodiment of the application, the spectral coefficient can be optimized, so that the tone of the output audio obtained by using the spectral coefficient in the subsequent process can be better kept.

Based on the above embodiments, the present embodiment will explain a specific weighted tonal output processing procedure. Referring to fig. 10, fig. 10 is a schematic flowchart of an output audio obtaining process according to an embodiment of the present application, and the step S103 includes:

s1031: and multiplying the frequency spectrum coefficient by the frequency spectrum distribution to obtain a weighted frequency spectrum.

In this embodiment, Ma (ω) may be used as the spectral coefficient. Specifically, the frequency spectrum coefficient is applied to the frequency spectrum of the frequency-converted signal, that is, the frequency-converted spectrum distribution, and the frequency-converted signal is corrected based on the input audio signal in a weighted form. It can be understood that, since the spectral coefficient is a spectral coefficient corresponding to a certain frequency conversion frame in the frequency conversion signal, the frequency conversion spectral distribution multiplied by the spectral coefficient is also a frequency spectrum corresponding to the frequency conversion frame. In particular, can be according to

X_ps(ω,n)＝Ma(ω)·X_fs(ω,n)

Obtaining a weighted spectrum, wherein X_ps(ω, n) is the weighted spectrum, X_fsAnd (omega, n) is frequency conversion spectrum distribution.

S1032: and carrying out time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio.

Because the weighted spectrum is still a frequency domain signal, it needs to be converted into a time domain signal, and a window function (for example, a hanning window) is used as a coefficient to realize the tonal modification and then output, so as to obtain a time domain output audio. Specifically, the following may be mentioned:

and obtaining the time domain output audio. Where t denotes a time domain sequence and n denotes a frame sequence, i.e. a frame number. The weighted spectrum corresponding to each input frame is converted according to the process, and the time domain output audio x corresponding to each input frame can be obtained_ps(t,n)。

S1033: and carrying out overlap-add processing on the time domain output audio to obtain the output audio.

After all the time-domain output audios are obtained, the time-domain output audios need to be spliced to obtain the output audio. Specifically, the following may be mentioned:

x_ps(t)＝OLA{x_ps(t,n)}

resulting in an output audio. Wherein, OLA is overlap addition processing.

By applying the audio frequency tone changing method provided by the embodiment of the application, when the input audio signal is divided into a plurality of input frames and the variable frequency signal is correspondingly divided into a plurality of variable frequency frames, the time domain output audio frequency corresponding to each variable frequency frame is obtained, and the output audio frequency is obtained by splicing the time domain output audio frequencies through overlapping and adding processing. The method can allow the user to carry out the tonal modification processing of different degrees on different parts of the input audio signal, and improves the flexibility degree of the tonal modification processing.

The following describes an audio tonal modification apparatus provided by an embodiment of the present application, and the audio tonal modification apparatus described below and the audio tonal modification method described above may be referred to correspondingly.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an audio frequency tone modifying apparatus according to an embodiment of the present application, including:

the frequency conversion module 110 is configured to perform frequency conversion processing on an input audio signal to obtain a frequency-converted signal;

a spectral coefficient generating module 120, configured to extract a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency conversion signal, and generate a spectral coefficient by using the first formant envelope and the second formant envelope;

and an output audio generating module 130, configured to perform weighted tonal modification output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient, so as to obtain an output audio.

In one embodiment, the spectral coefficient generation module 120 includes:

and the difference processing unit is used for obtaining the frequency spectrum coefficient by utilizing the difference between the first formant envelope and the second formant envelope.

In one embodiment, the spectral coefficient generation module 120 includes:

an initial spectral coefficient generating unit for generating an initial spectral coefficient using the formant envelope;

and the optimization unit is used for performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.

In a real-time manner, the output audio generating module 130 includes:

the weighting unit is used for multiplying the frequency spectrum coefficient by the frequency conversion frequency spectrum distribution to obtain a weighted frequency spectrum;

the time domain conversion processing unit is used for carrying out time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio;

and the overlap-add processing unit is used for performing overlap-add processing on the time domain output audio to obtain the output audio.

In one embodiment, the spectral coefficient generation module 120 includes:

a frequency domain generating unit, configured to perform frame division processing and frequency domain conversion processing based on a fundamental frequency on the input audio signal and the frequency-converted signal, respectively, to obtain frequency-converted frequency spectrum distribution and input frequency spectrum distribution corresponding to each frame;

the frequency spectrum smoothing processing unit is used for respectively carrying out power spectrum calculation and smoothing processing on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum;

the cepstrum processing unit is used for respectively carrying out logarithmic cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum;

and the lifting processing unit is used for performing cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by using a cepstrum lifting window respectively to obtain a first formant envelope and a second formant envelope.

In one embodiment, the frequency conversion module 110 includes:

the framing unit is used for framing the input audio signal and acquiring a tonal modification coefficient corresponding to each input frame;

the variable frequency processing unit is used for determining an execution sequence by using the tonal modification coefficient, and sequentially carrying out sampling processing and variable speed processing on the input frame according to the execution sequence to obtain a variable frequency frame;

and the splicing processing unit is used for splicing the frequency conversion frames to obtain the frequency conversion signals.

In one embodiment, a variable frequency processing unit includes:

a final value determining subunit, configured to acquire a plurality of pitch coefficients in a current processing period, and determine a median of each pitch coefficient;

and the sequence determining subunit is configured to determine the execution sequence corresponding to each input frame in the processing cycle according to a size relationship between the median and a preset threshold.

In one embodiment, a stitching processing unit includes:

the initial frequency conversion signal acquisition subunit is used for splicing the frequency conversion frames to obtain an initial frequency conversion signal;

and the smoothing processing subunit is used for smoothing the initial frequency conversion signal by using a gradual-in and gradual-out weighting window to obtain the frequency conversion signal.

The following describes a computer-readable storage medium provided in an embodiment of the present application, and the computer-readable storage medium described below and the audio tonal modification method described above may be referred to correspondingly.

The present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the audio tonal modification method described above.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An audio tonal modification method, comprising:

2. The audio pitch modification method of claim 1, wherein the generating spectral coefficients using the first formant envelope and the second formant envelope comprises:

3. The audio pitch modification method of claim 1, wherein the generating spectral coefficients using the first formant envelope and the second formant envelope comprises:

generating initial spectral coefficients using the first and second formant envelopes;

4. The audio frequency tonal modification method of claim 1, wherein the performing a weighted tonal modification output process on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio frequency comprises:

5. The audio pitch shifting method of claim 1, wherein said extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency-converted signal comprises:

6. The audio tonal modification method according to any of claims 1 to 5, wherein the frequency conversion processing the input audio signal to obtain a frequency-converted signal comprises:

7. The audio transposition method of claim 6, wherein the determining the execution order using the transposition coefficients comprises:

8. The audio pitch shifting method of claim 6, wherein the splicing the frequency-converted frames to obtain the frequency-converted signal comprises:

9. An audio tonal modification device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor for executing the computer program to implement the audio tonal modification method as claimed in any of claims 1 to 8.

10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the audio tonal modification method as claimed in any of claims 1 to 8.