CN112908351A - Audio tone changing method, device, equipment and storage medium - Google Patents

Audio tone changing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112908351A
CN112908351A CN202110083776.4A CN202110083776A CN112908351A CN 112908351 A CN112908351 A CN 112908351A CN 202110083776 A CN202110083776 A CN 202110083776A CN 112908351 A CN112908351 A CN 112908351A
Authority
CN
China
Prior art keywords
frequency
audio
signal
processing
frequency conversion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110083776.4A
Other languages
Chinese (zh)
Inventor
张超鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202110083776.4A priority Critical patent/CN112908351A/en
Publication of CN112908351A publication Critical patent/CN112908351A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

The application discloses an audio tone changing method, an audio tone changing device, audio equipment and a computer readable storage medium, wherein the method comprises the following steps: carrying out frequency conversion processing on an input audio signal to obtain a frequency conversion signal; extracting a first formant envelope corresponding to an input audio signal and a second formant envelope corresponding to a variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope; carrying out weighting modulation output processing on the variable frequency spectrum distribution of the variable frequency signal by using the spectrum coefficient to obtain output audio; the method utilizes the spectral coefficient to carry out weighting and tone-changing output, namely, corrects the frequency-changing signal based on the input audio signal in a weighting mode, so that the obtained output audio can keep the tone consistent with the input audio signal, and the conditions that the tone can not be kept and the audio quality is poor after tone changing are avoided.

Description

Audio tone changing method, device, equipment and storage medium
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to an audio tone modification method, an audio tone modification apparatus, an audio tone modification device, and a computer-readable storage medium.
Background
Audio (Audio) generally refers to sound waves audible to the human ear with sound frequencies between 20Hz and 20 kHz. For some audio (music, songs, etc.), it may be transposed as needed to turn the pitch of the audio up or down to make the audio feel better. In the related art, after the audio is modified, the obtained audio has a poorer timbre than the input audio, and the timbre keeping effect is poor.
Disclosure of Invention
In view of the above, an object of the present invention is to provide an audio tone modifying method, an audio tone modifying apparatus, an audio tone modifying device and a computer readable storage medium, which modify a frequency-converted signal based on an input audio signal, so that the obtained output audio can maintain a tone consistent with the input audio signal, and the tone of the output audio is prevented from being degraded.
In order to solve the above technical problem, in a first aspect, the present application provides an audio tonal modification method, including:
carrying out frequency conversion processing on an input audio signal to obtain a frequency conversion signal;
extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope;
and carrying out weighted tonal modification output processing on the frequency conversion frequency spectrum distribution of the frequency conversion signal by using the frequency spectrum coefficient to obtain an output audio.
Optionally, the generating spectral coefficients using the first formant envelope and the second formant envelope includes:
and obtaining the spectral coefficient by using the difference between the first formant envelope and the second formant envelope.
Optionally, the generating spectral coefficients using the first formant envelope and the second formant envelope includes:
generating initial spectral coefficients using the formant envelopes;
and performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.
Optionally, the performing, by using the spectral coefficient, weighted tonal modification output processing on the frequency conversion spectral distribution of the frequency conversion signal to obtain an output audio includes:
multiplying the frequency spectrum coefficient by the frequency conversion frequency spectrum distribution to obtain a weighted frequency spectrum;
performing time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio;
and carrying out overlap-add processing on the time domain output audio to obtain the output audio.
Optionally, the extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency conversion signal includes:
respectively carrying out frame processing and frequency domain conversion processing based on fundamental frequency on the input audio signal and the frequency conversion signal to obtain frequency conversion spectrum distribution and input spectrum distribution corresponding to each frame;
respectively carrying out power spectrum calculation and smoothing processing on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum;
respectively carrying out cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum;
and respectively carrying out cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by utilizing a cepstrum lifting window to obtain the first formant envelope and the second formant envelope.
Optionally, the frequency conversion processing the input audio signal to obtain a frequency-converted signal includes:
performing frame division processing on the input audio signal, and acquiring a tonal modification coefficient corresponding to each input frame;
determining an execution sequence by using the pitch-changing coefficient, and sequentially carrying out sampling processing and speed changing processing on the input frame according to the execution sequence to obtain a frequency-changing frame;
and splicing the frequency conversion frames to obtain the frequency conversion signal.
Optionally, the determining an execution order by using the pitch change coefficients includes:
acquiring a plurality of tonal modification coefficients in a current processing period, and determining a median of each tonal modification coefficient;
and determining the execution sequence corresponding to each input frame in the processing period according to the size relation between the median and a preset threshold.
Optionally, the splicing the frequency conversion frames to obtain the frequency conversion signal includes:
splicing the frequency conversion frames to obtain an initial frequency conversion signal;
and smoothing the initial frequency conversion signal by using a gradually-in and gradually-out weighting window to obtain the frequency conversion signal.
In a second aspect, the present application provides an audio tonal modification apparatus comprising:
the frequency conversion module is used for carrying out frequency conversion processing on the input audio signal to obtain a frequency conversion signal;
the frequency spectrum coefficient generating module is used for respectively extracting formant envelopes corresponding to the input audio signal and the variable frequency signal and generating frequency spectrum coefficients by utilizing the formant envelopes;
and the output audio generation module is used for performing weighted tonal modification output processing on the frequency conversion frequency spectrum distribution of the frequency conversion signal by using the frequency spectrum coefficient to obtain output audio.
In a third aspect, the present application provides an audio tonal modification apparatus comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the audio tonal modification method.
In a fourth aspect, the present application provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the audio transposition method described above.
According to the audio tone changing method, frequency conversion processing is carried out on an input audio signal to obtain a frequency conversion signal; extracting a first formant envelope corresponding to an input audio signal and a second formant envelope corresponding to a variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope; and carrying out weighting modulation output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio.
Therefore, the method extracts formant envelopes of the input audio signal and the variable frequency signal after the variable frequency signal is obtained by carrying out variable frequency processing on the input audio signal. Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. The formant envelope is utilized to generate the spectral coefficient, and the spectral coefficient is utilized to carry out weighting and tone-changing output, namely, the variable frequency signal is corrected based on the input audio signal in a weighting mode, so that the obtained output audio can keep the tone consistent with the input audio signal, the condition that the tone cannot be kept and the audio quality is poor after tone changing is avoided, and the problems that the audio quality is poor and the tone keeping effect is poor after tone changing are solved.
In addition, the application also provides an audio frequency tone changing device, an audio frequency tone changing device and a computer readable storage medium, and the beneficial effects are also achieved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic diagram of a hardware component framework for an audio pitch modification method according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a hardware component framework for another audio pitch modification method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of an audio pitch modification method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a variable frequency signal obtaining process according to an embodiment of the present disclosure;
fig. 5 is a logic diagram of a variable frequency signal acquisition process according to an embodiment of the present application;
fig. 6 is a schematic flow chart of a formant envelope extraction process provided in the embodiments of the present application;
fig. 7 is a graph illustrating a signal curve and a corresponding formant envelope curve according to an embodiment of the present disclosure;
FIG. 8 is a schematic illustration of a cepstral function curve and a cepstral curve provided in an embodiment of the present application;
fig. 9 is a schematic flowchart of a spectral coefficient obtaining process according to an embodiment of the present application;
fig. 10 is a flowchart illustrating an output audio obtaining process according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an audio pitch-changing device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Currently, when a user tunes audio (e.g., voice, pure music, song, etc.), tools such as world or TSMtoolbox are commonly used. The world analyzes and synthesizes the audio by adopting a synthesis tone-changing mode, and realizes the effect of tone changing. However, the quality of the signal obtained after the transposition is greatly affected by the fundamental frequency parameters, the spectral characteristics and the aperiodic characteristic parameters, and the tone color preservation effect is poor. TSMtoolbox uses HPS (harmonic Permission separation) to perform variable speed processing (WSOLA/PV, overlay-Add Technique Based On wave form Similarity/Phase Vocode) On hitting sound and harmonic sound respectively, then uses resampling module (sample) to realize frequency conversion, and uses cepstrum multi-iteration mode to extract formant envelope of frequency conversion signal, and uses formant envelope to obtain output audio. The whole process has large calculation amount, the tone of the output audio is poor, and the tone keeping effect is poor. According to the audio tone changing method provided by the embodiment of the application, after the input audio signal is subjected to frequency conversion processing to obtain the frequency conversion signal, the formant envelopes of the input audio signal and the frequency conversion signal are extracted. Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. The formant envelope is utilized to generate a spectral coefficient, and the spectral coefficient is utilized to perform weighting and tone-changing output, namely, in a weighting mode, a frequency-changing signal is corrected based on an input audio signal, so that the obtained output audio can keep the tone consistent with the input audio signal, and the situations that the tone can not be kept and the audio quality is poor after tone changing are avoided.
For convenience of understanding, a hardware composition framework used in a scheme corresponding to the audio tonal modification method provided in the embodiment of the present application is described first. Referring to fig. 1, fig. 1 is a schematic diagram of a hardware composition framework for an audio pitch modification method according to an embodiment of the present disclosure. Wherein the audio tonal apparatus 100 may comprise a processor 101 and a memory 102 and may further comprise one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
Wherein, the processor 101 is configured to control the overall operation of the audio tonal modification apparatus 100 to complete all or part of the steps in the audio tonal modification method; the memory 102 is used to store various types of data to support the operation at the audio tonal modification device 100, which data may include, for example, instructions for any application or method operating on the audio tonal modification device 100, as well as application related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk. In the present embodiment, the memory 102 stores therein at least programs and/or data for realizing the following functions:
carrying out frequency conversion processing on an input audio signal to obtain a frequency conversion signal;
extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope;
and carrying out weighting modulation output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio.
The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the audio transposition apparatus 100 and other apparatuses. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.
The audio pitch changing apparatus 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components for performing the audio pitch changing method.
Of course, the structure of the audio tonal apparatus 100 shown in fig. 1 does not constitute a limitation of the audio tonal apparatus in the embodiments of the present application, and in practical applications the audio tonal apparatus 100 may comprise more or less components than those shown in fig. 1, or some components in combination.
The audio tonal modification apparatus 100 in fig. 1 may be a terminal (e.g., a mobile terminal such as a mobile phone and a tablet computer, or a fixed terminal such as a PC) or a server. In a specific embodiment, the audio transposition device 100 can utilize the communication component 105 to receive audio transmitted by other devices or terminals through a network; in another specific embodiment, the audio transposition device 100 can acquire input audio using the multimedia component 103; in another particular embodiment, the audio transposition device 100 can retrieve the input audio from the memory 102.
It is to be understood that, in the embodiment of the present application, the number of the audio tonal modification apparatuses is not limited, and it may be that a plurality of audio tonal modification apparatuses cooperate together to complete the audio tonal modification method. In a possible implementation manner, please refer to fig. 2, and fig. 2 is a schematic diagram of a hardware composition framework to which another audio pitch modification method provided in the embodiment of the present application is applied. As can be seen from fig. 2, the hardware composition framework may include: the first audio transposition device 11 and the second audio transposition device 12 are connected by a network 13.
In the embodiment of the present application, the hardware structures of the first audio transposition device 11 and the second audio transposition device 12 may refer to the audio transposition device 100 in fig. 1. That is, it can be understood that there are two audio tonal modification devices 100 in the present embodiment, and the two devices perform data interaction, so as to achieve the effect of tonal modification of audio. Further, in this embodiment of the application, the form of the network 13 is not limited, that is, the network 13 may be a wireless network (e.g., WIFI, bluetooth, etc.), or may be a wired network.
The first audio tonal modification device 11 and the second audio tonal modification device 12 may be the same electronic device, for example, both the first audio tonal modification device 11 and the second audio tonal modification device 12 are servers; it may also be different types of electronic devices, for example, the first audio transposition device 11 may be a terminal or an intelligent electronic device, and the second audio transposition device 12 may be a server. In a possible embodiment, a server with high computing power can be used as the second audio transposition device 12 to improve the data processing efficiency and reliability, and thus the processing efficiency of audio transposition. Meanwhile, a terminal or an intelligent electronic device with low cost and wide application range is used as the first audio tonal modification device 11 to realize the interaction between the second audio tonal modification device 12 and the user. It can be understood that the audio transposition process performed on the input audio signal necessarily exists in the first step of obtaining the input audio signal, so the interaction process may be as follows: the terminal sends the input audio signal to the server after acquiring the input audio signal, and the server processes the input audio signal after receiving the input audio signal to obtain the output audio. In another embodiment, a user may set a relevant parameter for the frequency conversion processing at the terminal, where the interaction process may be: and after the terminal utilizes the related parameters of the frequency conversion processing to convert the frequency of the input audio signal, the terminal sends the frequency conversion signal and the input audio signal to the server so that the server can execute other steps.
Based on the above description, please refer to fig. 3, and fig. 3 is a flowchart illustrating an audio pitch modification method according to an embodiment of the present application. The method in this embodiment comprises:
s101: and carrying out frequency conversion processing on the input audio signal to obtain a frequency conversion signal.
The input audio signal may be audio in any format, such as WAV, MP3, MPEG-4, etc., which is not limited in this embodiment. The processed input audio signal can be one, that is, only one input audio signal is subjected to frequency conversion processing at one moment, and a new input audio signal is obtained after the frequency conversion signal of the input audio signal is obtained; in another embodiment, a plurality of input audio signals may be obtained, and the frequency conversion processing may be performed on the plurality of input audio signals at the same time to obtain frequency-converted signals. For example, a plurality of parallel channels may be provided, each of which may perform a frequency conversion process on an input audio signal, and when the input audio signal in one channel is processed, a new input audio signal may be acquired and frequency-converted in the channel until there is no input audio signal. The channels are independent and do not influence each other. After the input audio signal is obtained, the frequency conversion signal can be obtained by directly performing frequency conversion processing on the input audio signal, or necessary processing can be performed on the input audio signal, and after the processing is finished, the frequency conversion processing is performed on the processed input audio signal to obtain a corresponding frequency conversion signal.
It is understood that the input audio signal needs to be acquired because it needs to be frequency-converted. The embodiment does not limit the specific acquisition mode of the input audio signal, and for example, the input audio signal may be acquired by an audio acquisition device, which may be a microphone on an intelligent terminal; in another specific embodiment, a specified audio may be read from the storage medium and used as the input audio signal. It should be noted that the storage medium may be a local storage medium, or may be a portable storage medium, for example, a usb disk or a removable hard disk; in another specific embodiment, the input audio signal may be transmitted by other devices or terminals, for example, a smart phone, a server, etc., in a wired or wireless manner.
The embodiment does not limit the specific processing manner of the frequency conversion processing, and it can be understood that the frequency conversion process is to perform compression or expansion in the time domain range on the signal, so in a specific implementation, the frequency conversion processing may be to perform variable speed processing (i.e., time domain compression or expansion) first and then perform resampling processing; in another specific embodiment, the resampling process may be performed first, and then the variable speed process may be performed. Further, it should be noted that the present embodiment does not limit the specific sampling manner of resampling, for example, the resampling may be up-sampling, or the down-sampling may be performed, and the specific sampling manner may be selected according to the requirement of the frequency conversion processing. The present embodiment does not limit the number of times of frequency conversion processing and the processing range, and it should be noted that the number of times of processing is used to count the total number of processing on the entire input audio signal, and the processing range is used to specify which part of the input audio signal is to be frequency converted. The processing times and the processing range have a correlation relationship, and the larger the processing range of each time is, the fewer the processing times are; the smaller the range of each process, the more the number of processes. In this embodiment, the range of each processing may be referred to as a frame, and when the input audio signal is processed, it may be firstly framed and each frame may be subjected to a frequency conversion processing. Further, in a possible implementation manner, the processing manner of each frame is different, i.e., different frequency conversion processing may be performed on different frames, so as to perform different frequency conversion processing on each part of the input audio signal according to actual needs, and finally, each frame is spliced to obtain a corresponding frequency conversion signal.
S102: and extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the variable frequency signal, and generating spectral coefficients by using the first formant envelope and the second formant envelope.
Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. It should be noted that the formant envelope in this embodiment is not the envelope curve itself, but curve information corresponding to the envelope curve, and specifically, the formant envelope may be in the form of a log power spectrum. After the input signal and the frequency conversion signal are obtained, formant envelopes of the input signal and the frequency conversion signal are respectively extracted, namely, the input signal and the frequency conversion signal are respectively taken as extracted signals to extract formant envelopes, and a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency conversion signal are obtained. The embodiment does not limit the specific extraction manner of the formant envelope, and for example, the formant envelope may be extracted by adopting a cepstrum multiple iteration manner, or may be extracted by adopting other manners. And after the first formant envelope and the second formant envelope are obtained, generating a frequency spectrum coefficient by using the first formant envelope and the second formant envelope.
The spectral coefficients can reflect the difference between the input audio signal and the frequency-converted signal, and are obtained based on the formant envelope, so that the difference of the envelopes between the input audio signal and the frequency-converted signal can be reflected, and the difference of the envelopes is the difference of timbre. The frequency conversion signal can be corrected based on the input audio signal by utilizing the spectral coefficient to participate in the generation of the output audio, so that the obtained output audio can keep the tone consistent with the input audio signal, and the condition that the tone cannot be kept and the audio quality is poor after the tone conversion is avoided. In this embodiment, the number of formant envelopes and spectral coefficients corresponding to the input audio signal and the frequency conversion signal, respectively, is not limited, and in a specific implementation, the input audio signal and the frequency conversion signal may correspond to one formant envelope, respectively, in which case the number of spectral coefficients is also one. For example, when the input audio signal is short, the corresponding first formant envelope and the corresponding second formant envelope of the frequency-converted signal can be directly extracted, and the two formant envelopes are used to obtain the corresponding spectral coefficients. In another specific embodiment, in order to achieve the effect of varying the pitch of different portions of the input audio signal, the formant envelopes of the frames of the input audio signal may be extracted separately when extracting the formant envelopes, so that a plurality of first formant envelopes are shared, each corresponding to a different frame of the input audio signal. Accordingly, when the second formant envelope corresponding to the frequency-converted signal is extracted, the second formant envelope may be extracted for each frame of the frequency-converted signal. It can be understood that the frequency conversion signal and the input audio signal should adopt the same framing manner, so that the first formant envelope of each frame of the input audio signal and the second formant envelope of each frame of the frequency conversion signal can be in one-to-one correspondence, and the corresponding first formant envelope and the second formant envelope can be used to obtain the correctly corresponding spectral coefficients.
In one embodiment, coefficients obtained by directly using the first formant envelope and the second formant envelope may be determined as spectral coefficients. In another embodiment, the directly obtained coefficients may be optimized, and the corresponding spectral coefficients are obtained after the optimization. The specific way of the optimization process is not limited, and for example, the optimization process may be a convolution smoothing process, or may be a linear suppression process, or may be an adjacent coefficient smoothing process.
S103: and carrying out weighting modulation output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio.
After obtaining the spectral coefficients, the spectral coefficients may be used to perform a weighted tonal modification output process on the frequency-converted spectral distribution of the frequency-converted signal. The frequency conversion spectrum distribution is the distribution of the frequency conversion signal in the frequency domain, and the frequency conversion spectrum distribution can be weighted and adjusted in the frequency domain through weighting and tone-changing output, namely, the effect of correcting the frequency conversion signal based on the input audio signal is realized through a weighting mode, and after the correction is finished, the frequency conversion spectrum distribution is converted from the frequency domain to the time domain to finish the tone-changing output. It should be noted that, corresponding to step S102, if there are a plurality of spectral coefficients, the frequency spectrum corresponding to each frame in the frequency-converted signal is not subjected to weighted tonal modification output processing by using each spectral coefficient, and signal reconstruction is performed after processing, so that corresponding output audio can be obtained. The embodiment does not limit the specific signal reconstruction method, and for example, the signal reconstruction method may be performed by using overlap-add to obtain the output audio.
By applying the audio frequency tone changing method provided by the embodiment of the application, after the frequency conversion processing is carried out on the input audio signal to obtain the frequency conversion signal, the formant envelopes of the input audio signal and the frequency conversion signal are extracted. Formants are some regions where energy is relatively concentrated in the frequency spectrum of sound, and are determining factors of timbre, and the formant envelope is information representing formants. The formant envelope is utilized to generate the spectral coefficient, and the spectral coefficient is utilized to carry out weighting and tone-changing output, namely, the variable frequency signal is corrected based on the input audio signal in a weighting mode, so that the obtained output audio can keep the tone consistent with the input audio signal, the condition that the tone cannot be kept and the audio quality is poor after tone changing is avoided, and the problems that the audio quality is poor and the tone keeping effect is poor after tone changing are solved.
In a specific implementation manner, an embodiment of the present application provides a specific frequency conversion processing procedure. Referring to fig. 4, fig. 4 is a schematic flow chart of a variable frequency signal acquiring process according to an embodiment of the present application, where the schematic flow chart includes:
s1011: the input audio signal is subjected to framing processing, and a pitch change coefficient corresponding to each input frame is acquired.
In this embodiment, the input audio signal may be subjected to frame division processing to obtain a plurality of input frames, and each input frame may be subjected to different degrees of frequency conversion processing. The specific manner of the framing processing is not limited, for example, the length of each frame may be preset, and the input audio signal is framed according to the length; or the frame number corresponding to the input audio signal may be preset, and the input audio signal may be framed according to the frame number. What degree of frequency conversion processing is performed on each input frame can be represented by a pitch coefficient, and the pitch base number can be represented by γ. The key transposition coefficient may be manually input by a user, or may be transmitted by other devices or terminals, or may be pre-stored locally. The pitch adjustment coefficient is used for comparing with a preset threshold value to determine how to perform pitch adjustment, that is, performing pitch increasing processing or performing pitch decreasing processing, and the specific size of the preset threshold value is not limited in this embodiment, and may be 1, for example. When the key change coefficient is greater than 1, determining that the up-scaling processing needs to be performed on the input frame; when the key change coefficient is less than 1, it may be determined that the key reduction process needs to be performed on the input frame. In another embodiment, the magnitude of the key change coefficient may also be related to the degree of pitch up or pitch down, e.g., the range of the key change coefficient may be limited to [0, 2], and the greater the degree of pitch up in this range, the greater the degree of pitch up. Accordingly, the greater the degree of less than 1 in this range, the greater the degree of pitch reduction.
S1012: and determining an execution sequence by using the pitch-changing coefficient, and sequentially carrying out sampling processing and speed-changing processing on the input frame according to the execution sequence to obtain a variable frequency frame.
In the present embodiment, the frequency conversion processing of the input frame is realized by way of sampling processing for resampling the input frame and variable speed processing for time domain compression or expansion. In order to ensure that the sound quality of the frequency-converted signal is maintained, different frequency conversion processing methods may be provided for different tone conversion processes. Specifically, after the pitch-modulated coefficients are obtained, an execution sequence can be determined by using the pitch-modulated coefficients, so that sampling processing and speed change processing are performed on the input frames sequentially according to the execution sequence to obtain corresponding variable frequency frames. For example, when the pitch modulation coefficient is smaller than 1, it indicates that the pitch reduction processing needs to be performed, specifically, the input frame may be first up-sampled and then time-domain compressed (i.e., speed-variable processing) is performed to obtain a corresponding frequency-variable frame. Or when the pitch-changing coefficient is larger than 1, indicating that the pitch-increasing processing is needed, at this time, firstly performing time domain expansion processing on the input frame, and performing down-sampling after the expansion to obtain the corresponding frequency-changing frame.
Referring to fig. 5, fig. 5 is a logic diagram of a variable frequency signal obtaining process according to an embodiment of the present disclosure. (a) The case applies to the case where the transposition coefficient is less than 1, first for the signal x corresponding to the input frame (or the entire input audio signal)in(n) sampling (i.e. resampling) to obtain intermediate signal xr(k, l), then, the intermediate signal is processed with variable speed (i.e. PV/PSOLA, Phase Vocoder/pitch syndrome overlap, Phase Vocoder/base frequency synchronous superposition processing) to obtain the frequency conversion frame (or whole frequency conversion signal) xfs(n) of (a). (b) The condition is suitable for the condition that the tonal modification coefficient is larger than 1, firstly, the signal corresponding to the input audio signal is subjected to variable speed processing to obtain an intermediate signal xts(k, l), sampling the intermediate signal to obtain a frequency conversion frame xfs(n) of (a). Where n denotes a frame number.
S1013: and splicing the frequency conversion frames to obtain frequency conversion signals.
After all the frequency conversion frames corresponding to the input audio signal are obtained, all the frequency conversion frames are spliced to obtain the frequency conversion signal. The embodiment does not limit the specific splicing manner, for example, each frequency conversion frame may be directly subjected to time domain splicing, or may be subjected to certain processing after being directly spliced, so as to obtain a frequency conversion signal.
By applying the audio frequency tonal modification method provided by the embodiment of the application, different parts in the input audio signal can be tonal modified to different degrees by framing the input audio signal and setting a tonal modification coefficient for each input frame. Meanwhile, a proper frequency conversion mode can be selected for the corresponding part according to the tone variation coefficient so as to maintain the tone quality to the maximum extent and further prepare for maintaining the tone color after subsequent tone variation to the maximum extent.
Based on the above embodiment, in a possible implementation, the pitch-shifted coefficients of the adjacent input frames are in different intervals, i.e. switching back and forth between states greater than 1 and less than 1, which may result in a sense of hearing loss, specifically, a "click" sound at the switching point of two frequency-shifted frames. In order to solve the above problem, the S1013 step may include:
step 11: and splicing the frequency conversion frames to obtain an initial frequency conversion signal.
The specific splicing manner is not limited in this embodiment, and for example, the splicing may be performed directly, that is, a plurality of frequency conversion frames are connected according to the sequence of the corresponding input frames, so that the initial frequency conversion signal may be obtained.
Step 12: and smoothing the initial frequency conversion signal by using a gradually-in and gradually-out weighting window to obtain a frequency conversion signal.
The fade-in/fade-out weighting window can perform fade-in/fade-out processing on the signal, and avoid the occurrence of 'click' sound caused by discontinuity of the signal at the frame switching position. Specifically, for example, the fade-in and fade-out weighting window performs smoothing processing on the initial frequency conversion signal, but the present embodiment does not limit the specific processing manner of the smoothing processing, for example, a part of the initial frequency conversion signal may be smoothed, and since the problem of signal discontinuity only occurs at the switching position of two frames, the splicing position of the signal, that is, the switching position of two frames may be suppressed by using the fade-in and fade-out weighting window, so as to perform smoothing processing on the initial frequency conversion signal.
By applying the audio frequency tone changing method provided by the embodiment of the application, the initial variable frequency signal obtained by direct splicing is subjected to smooth processing by utilizing the gradually-in and gradually-out weighting window, so that the problem of discontinuity in hearing sense caused by signal discontinuity can be avoided.
In another embodiment, discontinuity of signals can be avoided by determining an execution sequence corresponding to each input frame in the processing cycle, so as to avoid a discontinuity problem in hearing, and the step S1013 may include:
step 21: and acquiring and processing a plurality of pitch coefficients in the current period, and determining the median of each pitch coefficient.
In this embodiment, processing cycles are provided, each processing cycle is connected, and each processing cycle includes a plurality of connected variable frequency frames (including corresponding input frames). When determining the execution order corresponding to a certain input frame, the pitch coefficients corresponding to all input frames in the processing cycle (i.e., the current processing cycle) in which the certain input frame is located may be obtained, and the median of the pitch coefficients may be determined. The median value may represent a main median region of each pitch coefficient corresponding to the current processing cycle. The present embodiment does not limit the specific arrangement manner of the processing cycles, and for example, a preset number of input frames may be set for each processing cycle, or a preset number of processing cycles may be set for each input audio signal. In another embodiment, it may also be detected whether there are multiple consecutive input frames whose corresponding pitch shift coefficients are switched back and forth between states greater than 1 and less than 1, and if there are multiple consecutive input frames, the input frames are divided into the same processing cycle.
Step 22: and determining the execution sequence corresponding to each input frame in the processing period according to the size relation between the median and a preset threshold.
Comparing the median value with a preset threshold value, determining the magnitude relation between the median value and the preset threshold value, wherein the magnitude relation can be used for representing the execution sequence corresponding to most input frames in the current processing period, determining the execution sequence as the execution sequence corresponding to each input frame in the current processing period, and performing frequency conversion processing on each input frame in the current processing period by adopting the same execution sequence, so that the frequency conversion frames obtained after processing are continuous on signals, and further the signals are continuous on the auditory sense.
By applying the audio frequency tone changing method provided by the embodiment of the application, the discontinuity on the signal and the discontinuity on the auditory sense can be avoided without performing other processing on the signal directly obtained after frequency conversion processing.
In another possible embodiment, the above two modes may be combined, that is, after the input frames in the current processing cycle are processed in the same execution order, the switching positions of the processing cycles may be processed by using the fade-in and fade-out weighting window.
Based on the foregoing embodiments, in a specific implementation manner, the present application provides a specific formant envelope extraction process. Referring to fig. 6, fig. 6 is a schematic flow chart of a formant envelope extraction process according to an embodiment of the present application, which includes:
s1021: and respectively carrying out frame division processing and frequency domain conversion processing based on fundamental frequency on the input audio signal and the frequency conversion signal to obtain frequency conversion spectrum distribution and input spectrum distribution corresponding to each frame.
Specifically, the framing processing based on the fundamental frequency is to determine a framing window by using the fundamental frequency, frame the input audio signal and the frequency-converted signal according to the framing window, perform frequency domain conversion processing after the framing, convert the input audio signal and the frequency-converted signal into frequency domain signals, and obtain frequency-converted spectrum distribution and input spectrum distribution corresponding to each frame. Specifically, the fundamental frequency is the fundamental frequency of the input signal, and a specific obtaining manner thereof is not limited, for example, a fundamental frequency extracting tool may be used for extracting, or fundamental frequency information may be obtained, and the fundamental frequency information is used for specifying a specific size of the fundamental frequency. The specific form of the fundamental frequency extracting tool is not limited, and may be a harvest tool, a Pyin tool, or a creee tool, for example.
Specifically, the length of the fundamental frequency period sequence may be T0When framing is performed, the frame may be divided into 3T frames0The window length is 1.5T for each side0Is used to frame the input audio signal and the frequency converted signal. And a framing window (Hanning window) is taken as a window function to obtain a framed frame signal sequence, and meanwhile, the frequency domain conversion processing is carried out through STFT (short time Fourier transform) to obtain the frequency conversion spectrum distribution and the input spectrum distribution of the signal. The embodiment does not limit the window functionFor example, a hanning window is used as a window function, and in other embodiments, other window functions may be used. Specifically, when the input audio signal is xinWhen (t), the frequency-converted signal is xfs(t) the corresponding frequency conversion spectrum distribution and input spectrum distribution are:
Figure BDA0002910073840000151
Figure BDA0002910073840000152
wherein n represents the frame number after framing, F is the short time Fourier transform, whannFor Hanning Window processing, ω represents the digital angular frequency variable of the frame, Xin(ω, n) is the input spectral distribution, XfsAnd (omega, n) is frequency conversion spectrum distribution, and t is time.
S1022: and respectively carrying out power spectrum calculation and smoothing treatment on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum.
And respectively carrying out power spectrum calculation on the input frequency spectrum distribution and the variable frequency spectrum distribution by using a power spectrum formula, and carrying out smoothing treatment according to a preset length after calculation. The specific size of the preset length is not limited in this embodiment, and for example, the power spectrum may be linearly smoothed by using 2/3 of the digital angular frequency corresponding to the fundamental frequency as the preset length. Specifically, the power spectrum formula is as follows:
Figure BDA0002910073840000153
where P (ω) is the pair signal FTAnd (omega) performing power spectrum calculation to obtain a power spectrum directly, which can be called as an initial power spectrum, and performing smoothing treatment on the initial power spectrum to obtain the power spectrum. Wherein, FT(ω) may specifically be Xin(ω, n) or Xfs(ω, n), i.e. the input spectral distribution corresponding to an arbitrary frame obtained after the input audio signal is framed,or the frequency conversion spectrum distribution corresponding to any frame obtained after the frequency conversion signal is framed.
Specifically, the smoothing process may be performed using the following formula:
Figure BDA0002910073840000154
wherein P iss(ω) is the power spectrum, in particular, P can be useds_in(ω) represents the input power spectrum, using Ps_fs(ω) represents the frequency-converted power spectrum, ω0For the length of the fundamental frequency in the frequency domain, the length of the fundamental frequency characteristic is 2 omega0The rectangular window of/3 linearly smoothes the power spectrum. In other embodiments, rectangular windows of other lengths may also be used for smoothing.
S1023: and respectively carrying out cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum.
The cepstrum processing is to first solve the logarithm and then perform cepstrum processing on the logarithm, and specifically, the cepstrum processing may be performed by using:
Figure BDA0002910073840000161
performing a logarithmic process, wherein log is a logarithmic function, Ps(ω) may specifically be Ps_in(ω), may be Ps_fs(ω),
Figure BDA0002910073840000162
The intermediate log data. And use:
Figure BDA0002910073840000163
and (5) performing cepstrum processing. Wherein,
Figure BDA0002910073840000164
is a cepstrum. Depending on the input signal, this embodiment is advantageousBy using
Figure BDA0002910073840000165
A representation of the input cepstrum is made,
Figure BDA0002910073840000166
representing a frequency-converted cepstrum. Wherein F-1For inverse fourier transformation, τ is the argument.
S1024: and respectively carrying out cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by using the cepstrum lifting window to obtain a first resonance peak envelope and a second resonance peak envelope.
After the input cepstrum and the frequency conversion cepstrum are obtained, a cepstrum lifting window can be constructed, cepstrum windowing processing is carried out on the input cepstrum and the frequency conversion cepstrum by using the cepstrum lifting window, frequency spectrum recovery processing is carried out, and the data are recovered to a frequency domain so as to execute subsequent steps. Specifically, the following may be mentioned:
Figure BDA0002910073840000167
a cepstrum lifting window is constructed where sin c represents the sampling function and q is a constant, and as a rule of thumb q-0.09. I issAs a sampling function, IqIs a raised cosine function, w (τ) is a cepstrum lifting window function, and cos is a cosine function.
After obtaining the cepstrum lifting window, use is made of:
Figure BDA0002910073840000168
performing cepstral windowing using:
Figure BDA0002910073840000169
performing frequency spectrum recovery processing to obtain signal logarithmic formant envelope
Figure BDA00029100738400001610
In this embodiment, can utilize
Figure BDA00029100738400001611
Representing the log-formant envelope of a first signal corresponding to an input audio signal, using
Figure BDA00029100738400001612
Representing the second signal log formant envelope corresponding to the frequency converted signal. After the log-formant envelope of the signal is obtained, it is also necessary to utilize:
Figure BDA00029100738400001613
Figure BDA00029100738400001614
and obtaining a corresponding first formant envelope and a second formant envelope. Wherein, LEnin(ω) is the first formant envelope, LEnfsAnd (ω) is the second formant envelope. It should be noted that, since the input audio signal and the frequency-converted signal are framed, the obtained first formant envelope corresponds to a frame of the input audio signal, and the obtained second formant envelope corresponds to a frame of the frequency-converted signal corresponding to the frame of the input audio signal. After each frame is processed, a first formant envelope corresponding to the whole input audio signal and a second formant envelope corresponding to the whole variable frequency signal can be obtained.
Referring to fig. 7, fig. 7 is a graph illustrating a signal curve and a corresponding formant envelope curve according to an embodiment of the present disclosure. The method specifically comprises the comparison of curves corresponding to a power spectrum before and after smoothing treatment, wherein the abscissa is the distribution of frequency points (bins), and the ordinate is a frequency spectrum power decibel value. At a sampling rate (fs) of 48000Hz and a fundamental frequency (f)0) For a voiced sound signal of 586Hz, for example, the corresponding bin (bin) is bin fs/f 0-82. The three curves in fig. 7 represent: log power spectrum curve of original signal (power spectrum)
Figure BDA0002910073840000171
Smoothed log spectrum curve (smoothed spectrum) obtained by linear smoothing
Figure BDA0002910073840000172
And signal log spectrum envelope curves (spectra enveloppe) obtained by spectrum lifting (i.e. cepstral windowing and spectrum restoration)
Figure BDA0002910073840000173
Referring to fig. 8, fig. 8 is a schematic diagram of a cepstrum function curve and a cepstrum curve according to an embodiment of the present disclosure. The method specifically includes curve comparison before and after cepstrum lifting, wherein the abscissa of a waveform diagram above the graph 8 is a frequency point, and the ordinate is a function amplitude value. The abscissa of the waveform diagram at the bottom of fig. 8 is a cepstrum point (quefrency), and the ordinate is a value of the logarithm of the cepstrum. The curves in the upper waveform diagram of FIG. 8 are the sampling function curves lsI.e. ls(τ), raised cosine curve lqI.e. lq(τ); cepstrum lifting window curve (l)s.*lq) I.e. w (τ). Also at the sampling rate (f)s) 48000Hz, fundamental frequency (f)0) For a voiced sound signal of 586Hz, for example, the corresponding bin (bin) is bin fs/f 0-82. The w (τ) curve ideally takes a value of zero at the fundamental frequency (bin 82) and its multiples, so that the influence of the excitation pulse sequence on the signal envelope extraction can be removed. The waveform diagram at the bottom of fig. 8 is a cepstrum curve of a 586Hz voiced sound signal, which includes a raw cepstrum curve and a weighted cepstrum curve (weighted) cepstrum curve after spectrum lifting.
By applying the audio tone changing method provided by the embodiment of the application, the formant envelopes corresponding to the input audio signal and the output audio can be obtained by only one round of calculation by utilizing the above mode, and the formant envelopes do not need to be obtained by adopting a mode of repeated iterative extraction, so that the calculation speed and the calculation efficiency are improved, and the consumption of calculation resources and time is reduced.
Based on the foregoing embodiments, in a specific implementation manner, in order to obtain accurate spectral coefficients, the step of generating the spectral coefficients using formant envelopes may include:
step 31: and obtaining a spectral coefficient by using the difference between the first formant envelope and the second formant envelope.
In this embodiment, the spectral coefficients may be represented by Mask (ω). In order to obtain accurate spectral coefficients, i.e. accurately represent the difference between the timbre of the frequency-converted signal and the input signal, the spectral coefficients may be calculated by subtraction. Specifically, the following steps are carried out:
Mask(ω)=LEnin(ω)-LEnfs(ω)
and obtaining the spectral coefficient.
It should be noted that, step 31 is not the only way to calculate the spectral coefficients, and the audio pitch modification method provided by the present application may calculate the spectral coefficients by using other calculation ways, for example, may calculate the spectral coefficients by subtracting, dividing, and the like after weighting, as long as the difference between the timbres of the frequency-converted signal and the input signal can be represented.
In another embodiment, the initial spectral coefficient is directly calculated by utilizing the formant envelope, and the corresponding spectral coefficient can be obtained after the log-time spectral coefficient is optimized. Specifically, please refer to fig. 9, where fig. 9 is a schematic flow chart of a spectral coefficient obtaining process according to an embodiment of the present disclosure. The step of generating spectral coefficients using the formant envelopes may comprise:
s1025: initial spectral coefficients are generated using the first and second formant envelopes.
In this embodiment, the first formant envelope and the second formant envelope are directly used to generate not spectral coefficients but initial spectral coefficients. The initial spectral coefficients need to be optimized to obtain the spectral coefficients.
S1026: and performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.
In this embodiment, the optimization processing may include one or more of convolution smoothing processing, linear suppression processing based on a fundamental frequency, and adjacent coefficient smoothing processing, and one or more of the convolution smoothing processing, the linear suppression processing based on a fundamental frequency, and the adjacent coefficient smoothing processing may be selected as a specific processing mode of the optimization processing to optimize the initial spectral coefficient, so as to obtain the spectral coefficient. The convolution smoothing process can prevent the spectral coefficient from shaking too much, and can smooth the initial spectral coefficient by utilizing triangular window convolution. Specifically, the following steps are carried out:
M(ω)=conv(Mask(ω),Wtri(ω))
convolution smoothing is performed, where conv is convolution calculation, and M (ω) is an initial spectral coefficient subjected to convolution smoothing, which may be directly used as a spectral coefficient or may be used as intermediate data to perform subsequent processing. Wherein, Wtri(ω) represents a triangular window function, which is a window function constructed using three data points, which may be, for example, Wtri(ω)=[0.25,0.50,0.25]。
The linear suppression processing based on the fundamental frequency is used to suppress (i.e., fade-in processing) frequency segments of the initial spectral coefficients (or intermediate data subjected to other optimization processing) having frequencies smaller than the fundamental frequency, and it is possible to prevent unstable influence on the low-frequency signal energy due to the weighting processing (i.e., weighting portions in the weighting transposition output processing). For example, when performing linear suppression processing based on an extremum on the initial spectral coefficients subjected to the convolution smoothing processing, the following may be performed:
Figure BDA0002910073840000191
and performing linear inhibition processing based on the fundamental frequency. Wherein ω is0The fundamental frequency is, the other is omega is not less than omega0. The input value in the above formula is the initial spectral coefficient M (ω) subjected to the convolution smoothing process, and in other embodiments, the initial spectral coefficient or intermediate data subjected to other optimization processes may be directly used as the input value. The output value Ma (ω) in the above formula may be directly used as a spectral coefficient, or may be used as intermediate data in order to perform other optimization processing.
In order to ensure that the spectral coefficients are smoothly transited in the time domain, the spectral coefficients of the current frame may be further smoothed by using the spectral coefficients of the adjacent frames of each frame. Specifically, the following may be mentioned:
Mak(ω,n)=α·Ma(ω,n-1)+β·Ma(ω,n)+ξ·Ma(ω,n+1)
and performing adjacent coefficient smoothing processing, wherein alpha, beta and xi are real numbers with values of [0,1], and alpha + beta + xi is 1. The specific sizes of α, β, and ξ are not limited, and may be, for example, 1/3, or may be 0.25, 0.5, 0.25 in this order. The above formula performs the adjacent coefficient smoothing processing in the form of one frame for each of the left and right sides, and in another embodiment, the adjacent coefficient smoothing processing may be performed in the form of two frames, three frames, or more frames for each of the left and right sides, which is not limited in this embodiment. In this embodiment, the adjacent coefficient smoothing processing is the last optimization processing, and in other embodiments, the initial spectral coefficient may be optimized by performing three or two optimization processing in other execution sequences to obtain the spectral coefficient.
By applying the audio frequency tone changing method provided by the embodiment of the application, the spectral coefficient can be optimized, so that the tone of the output audio obtained by using the spectral coefficient in the subsequent process can be better kept.
Based on the above embodiments, the present embodiment will explain a specific weighted tonal output processing procedure. Referring to fig. 10, fig. 10 is a schematic flowchart of an output audio obtaining process according to an embodiment of the present application, and the step S103 includes:
s1031: and multiplying the frequency spectrum coefficient by the frequency spectrum distribution to obtain a weighted frequency spectrum.
In this embodiment, Ma (ω) may be used as the spectral coefficient. Specifically, the frequency spectrum coefficient is applied to the frequency spectrum of the frequency-converted signal, that is, the frequency-converted spectrum distribution, and the frequency-converted signal is corrected based on the input audio signal in a weighted form. It can be understood that, since the spectral coefficient is a spectral coefficient corresponding to a certain frequency conversion frame in the frequency conversion signal, the frequency conversion spectral distribution multiplied by the spectral coefficient is also a frequency spectrum corresponding to the frequency conversion frame. In particular, can be according to
Xps(ω,n)=Ma(ω)·Xfs(ω,n)
Obtaining a weighted spectrum, wherein Xps(ω, n) is the weighted spectrum, XfsAnd (omega, n) is frequency conversion spectrum distribution.
S1032: and carrying out time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio.
Because the weighted spectrum is still a frequency domain signal, it needs to be converted into a time domain signal, and a window function (for example, a hanning window) is used as a coefficient to realize the tonal modification and then output, so as to obtain a time domain output audio. Specifically, the following may be mentioned:
Figure BDA0002910073840000201
and obtaining the time domain output audio. Where t denotes a time domain sequence and n denotes a frame sequence, i.e. a frame number. The weighted spectrum corresponding to each input frame is converted according to the process, and the time domain output audio x corresponding to each input frame can be obtainedps(t,n)。
S1033: and carrying out overlap-add processing on the time domain output audio to obtain the output audio.
After all the time-domain output audios are obtained, the time-domain output audios need to be spliced to obtain the output audio. Specifically, the following may be mentioned:
xps(t)=OLA{xps(t,n)}
resulting in an output audio. Wherein, OLA is overlap addition processing.
By applying the audio frequency tone changing method provided by the embodiment of the application, when the input audio signal is divided into a plurality of input frames and the variable frequency signal is correspondingly divided into a plurality of variable frequency frames, the time domain output audio frequency corresponding to each variable frequency frame is obtained, and the output audio frequency is obtained by splicing the time domain output audio frequencies through overlapping and adding processing. The method can allow the user to carry out the tonal modification processing of different degrees on different parts of the input audio signal, and improves the flexibility degree of the tonal modification processing.
The following describes an audio tonal modification apparatus provided by an embodiment of the present application, and the audio tonal modification apparatus described below and the audio tonal modification method described above may be referred to correspondingly.
Referring to fig. 11, fig. 11 is a schematic structural diagram of an audio frequency tone modifying apparatus according to an embodiment of the present application, including:
the frequency conversion module 110 is configured to perform frequency conversion processing on an input audio signal to obtain a frequency-converted signal;
a spectral coefficient generating module 120, configured to extract a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency conversion signal, and generate a spectral coefficient by using the first formant envelope and the second formant envelope;
and an output audio generating module 130, configured to perform weighted tonal modification output processing on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient, so as to obtain an output audio.
In one embodiment, the spectral coefficient generation module 120 includes:
and the difference processing unit is used for obtaining the frequency spectrum coefficient by utilizing the difference between the first formant envelope and the second formant envelope.
In one embodiment, the spectral coefficient generation module 120 includes:
an initial spectral coefficient generating unit for generating an initial spectral coefficient using the formant envelope;
and the optimization unit is used for performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.
In a real-time manner, the output audio generating module 130 includes:
the weighting unit is used for multiplying the frequency spectrum coefficient by the frequency conversion frequency spectrum distribution to obtain a weighted frequency spectrum;
the time domain conversion processing unit is used for carrying out time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio;
and the overlap-add processing unit is used for performing overlap-add processing on the time domain output audio to obtain the output audio.
In one embodiment, the spectral coefficient generation module 120 includes:
a frequency domain generating unit, configured to perform frame division processing and frequency domain conversion processing based on a fundamental frequency on the input audio signal and the frequency-converted signal, respectively, to obtain frequency-converted frequency spectrum distribution and input frequency spectrum distribution corresponding to each frame;
the frequency spectrum smoothing processing unit is used for respectively carrying out power spectrum calculation and smoothing processing on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum;
the cepstrum processing unit is used for respectively carrying out logarithmic cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum;
and the lifting processing unit is used for performing cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by using a cepstrum lifting window respectively to obtain a first formant envelope and a second formant envelope.
In one embodiment, the frequency conversion module 110 includes:
the framing unit is used for framing the input audio signal and acquiring a tonal modification coefficient corresponding to each input frame;
the variable frequency processing unit is used for determining an execution sequence by using the tonal modification coefficient, and sequentially carrying out sampling processing and variable speed processing on the input frame according to the execution sequence to obtain a variable frequency frame;
and the splicing processing unit is used for splicing the frequency conversion frames to obtain the frequency conversion signals.
In one embodiment, a variable frequency processing unit includes:
a final value determining subunit, configured to acquire a plurality of pitch coefficients in a current processing period, and determine a median of each pitch coefficient;
and the sequence determining subunit is configured to determine the execution sequence corresponding to each input frame in the processing cycle according to a size relationship between the median and a preset threshold.
In one embodiment, a stitching processing unit includes:
the initial frequency conversion signal acquisition subunit is used for splicing the frequency conversion frames to obtain an initial frequency conversion signal;
and the smoothing processing subunit is used for smoothing the initial frequency conversion signal by using a gradual-in and gradual-out weighting window to obtain the frequency conversion signal.
The following describes a computer-readable storage medium provided in an embodiment of the present application, and the computer-readable storage medium described below and the audio tonal modification method described above may be referred to correspondingly.
The present application further provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the audio tonal modification method described above.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. An audio tonal modification method, comprising:
carrying out frequency conversion processing on an input audio signal to obtain a frequency conversion signal;
extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the variable frequency signal, and generating a spectral coefficient by using the first formant envelope and the second formant envelope;
and carrying out weighted tonal modification output processing on the frequency conversion frequency spectrum distribution of the frequency conversion signal by using the frequency spectrum coefficient to obtain an output audio.
2. The audio pitch modification method of claim 1, wherein the generating spectral coefficients using the first formant envelope and the second formant envelope comprises:
and obtaining the spectral coefficient by using the difference between the first formant envelope and the second formant envelope.
3. The audio pitch modification method of claim 1, wherein the generating spectral coefficients using the first formant envelope and the second formant envelope comprises:
generating initial spectral coefficients using the first and second formant envelopes;
and performing convolution smoothing processing and/or linear suppression processing based on fundamental frequency and/or adjacent coefficient smoothing processing on the initial spectral coefficient to obtain the spectral coefficient.
4. The audio frequency tonal modification method of claim 1, wherein the performing a weighted tonal modification output process on the frequency conversion spectrum distribution of the frequency conversion signal by using the spectrum coefficient to obtain an output audio frequency comprises:
multiplying the frequency spectrum coefficient by the frequency conversion frequency spectrum distribution to obtain a weighted frequency spectrum;
performing time domain conversion processing based on a window function on the weighted frequency spectrum to obtain time domain output audio;
and carrying out overlap-add processing on the time domain output audio to obtain the output audio.
5. The audio pitch shifting method of claim 1, wherein said extracting a first formant envelope corresponding to the input audio signal and a second formant envelope corresponding to the frequency-converted signal comprises:
respectively carrying out frame processing and frequency domain conversion processing based on fundamental frequency on the input audio signal and the frequency conversion signal to obtain frequency conversion spectrum distribution and input spectrum distribution corresponding to each frame;
respectively carrying out power spectrum calculation and smoothing processing on the frequency conversion frequency spectrum distribution and the input frequency spectrum distribution to obtain a frequency conversion power spectrum and an input power spectrum;
respectively carrying out cepstrum processing on the variable frequency power spectrum and the input power spectrum to obtain a variable frequency cepstrum and an input cepstrum;
and respectively carrying out cepstrum windowing and frequency spectrum recovery processing on the frequency conversion cepstrum and the input cepstrum by utilizing a cepstrum lifting window to obtain the first formant envelope and the second formant envelope.
6. The audio tonal modification method according to any of claims 1 to 5, wherein the frequency conversion processing the input audio signal to obtain a frequency-converted signal comprises:
performing frame division processing on the input audio signal, and acquiring a tonal modification coefficient corresponding to each input frame;
determining an execution sequence by using the pitch-changing coefficient, and sequentially carrying out sampling processing and speed changing processing on the input frame according to the execution sequence to obtain a frequency-changing frame;
and splicing the frequency conversion frames to obtain the frequency conversion signal.
7. The audio transposition method of claim 6, wherein the determining the execution order using the transposition coefficients comprises:
acquiring a plurality of tonal modification coefficients in a current processing period, and determining a median of each tonal modification coefficient;
and determining the execution sequence corresponding to each input frame in the processing period according to the size relation between the median and a preset threshold.
8. The audio pitch shifting method of claim 6, wherein the splicing the frequency-converted frames to obtain the frequency-converted signal comprises:
splicing the frequency conversion frames to obtain an initial frequency conversion signal;
and smoothing the initial frequency conversion signal by using a gradually-in and gradually-out weighting window to obtain the frequency conversion signal.
9. An audio tonal modification device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor for executing the computer program to implement the audio tonal modification method as claimed in any of claims 1 to 8.
10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the audio tonal modification method as claimed in any of claims 1 to 8.
CN202110083776.4A 2021-01-21 2021-01-21 Audio tone changing method, device, equipment and storage medium Pending CN112908351A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110083776.4A CN112908351A (en) 2021-01-21 2021-01-21 Audio tone changing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110083776.4A CN112908351A (en) 2021-01-21 2021-01-21 Audio tone changing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112908351A true CN112908351A (en) 2021-06-04

Family

ID=76118172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110083776.4A Pending CN112908351A (en) 2021-01-21 2021-01-21 Audio tone changing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112908351A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782050A (en) * 2021-09-08 2021-12-10 浙江大华技术股份有限公司 Sound tone changing method, electronic device and storage medium
CN114067784A (en) * 2021-11-24 2022-02-18 云知声智能科技股份有限公司 Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device
CN114121029A (en) * 2021-12-23 2022-03-01 北京达佳互联信息技术有限公司 Training method and device of speech enhancement model and speech enhancement method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101354889A (en) * 2008-09-18 2009-01-28 北京中星微电子有限公司 Method and apparatus for tonal modification of voice
US20130044889A1 (en) * 2011-08-15 2013-02-21 Oticon A/S Control of output modulation in a hearing instrument
CN109410973A (en) * 2018-11-07 2019-03-01 北京达佳互联信息技术有限公司 Voice change process method, apparatus and computer readable storage medium
CN111383646A (en) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 Voice signal transformation method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101354889A (en) * 2008-09-18 2009-01-28 北京中星微电子有限公司 Method and apparatus for tonal modification of voice
US20130044889A1 (en) * 2011-08-15 2013-02-21 Oticon A/S Control of output modulation in a hearing instrument
CN109410973A (en) * 2018-11-07 2019-03-01 北京达佳互联信息技术有限公司 Voice change process method, apparatus and computer readable storage medium
CN111383646A (en) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 Voice signal transformation method, device, equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
徐欣;李枚亭;: "基于频谱包络算法的语音转换研究", 数字技术与应用, no. 09 *
潘涛等: "基于不同算法的语音信号共振峰提取研究与实现", 《甘肃科技》, 30 November 2019 (2019-11-30) *
潘涛等: "基于不同算法的语音信号共振峰提取研究与实现", 甘肃科技, vol. 35, no. 22, 30 November 2019 (2019-11-30) *
赵力: "《语音信号处理》", 30 April 2003, 机械工业出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782050A (en) * 2021-09-08 2021-12-10 浙江大华技术股份有限公司 Sound tone changing method, electronic device and storage medium
CN114067784A (en) * 2021-11-24 2022-02-18 云知声智能科技股份有限公司 Training method and device of fundamental frequency extraction model and fundamental frequency extraction method and device
CN114121029A (en) * 2021-12-23 2022-03-01 北京达佳互联信息技术有限公司 Training method and device of speech enhancement model and speech enhancement method and device

Similar Documents

Publication Publication Date Title
US9294060B2 (en) Bandwidth extender
CN112908351A (en) Audio tone changing method, device, equipment and storage medium
Le Roux et al. Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction.
JP4945586B2 (en) Signal band expander
JP2018510374A (en) Apparatus and method for processing an audio signal to obtain a processed audio signal using a target time domain envelope
CN111508508A (en) Super-resolution audio generation method and equipment
US20140019125A1 (en) Low band bandwidth extended
US20230343348A1 (en) Machine-Learned Differentiable Digital Signal Processing
Marafioti et al. Audio inpainting of music by means of neural networks
JP6821970B2 (en) Speech synthesizer and speech synthesizer
CN111739544A (en) Voice processing method and device, electronic equipment and storage medium
CN113241082A (en) Sound changing method, device, equipment and medium
CN113421584B (en) Audio noise reduction method, device, computer equipment and storage medium
WO2023224550A1 (en) Method and system for real-time and low latency synthesis of audio using neural networks and differentiable digital signal processors
JP7103390B2 (en) Acoustic signal generation method, acoustic signal generator and program
JP2019074580A (en) Speech recognition method, apparatus and program
JP4645869B2 (en) DIGITAL SIGNAL PROCESSING METHOD, LEARNING METHOD, DEVICE THEREOF, AND PROGRAM STORAGE MEDIUM
CN113113033A (en) Audio processing method and device and readable storage medium
Hanna et al. Time scale modification of noises using a spectral and statistical model
Samui et al. FPGA implementation of a phase-aware single-channel speech enhancement system
Zivanovic Harmonic bandwidth companding for separation of overlapping harmonics in pitched signals
US20240161762A1 (en) Full-band audio signal reconstruction enabled by output from a machine learning model
EP4018440B1 (en) Multi-lag format for audio coding
JP4419486B2 (en) Speech analysis generation apparatus and program
RU2825309C2 (en) Multiple-delay audio encoding format

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination