CN117524240A - Voice sound changing method, device, equipment and storage medium - Google Patents

Voice sound changing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117524240A
CN117524240A CN202311447295.2A CN202311447295A CN117524240A CN 117524240 A CN117524240 A CN 117524240A CN 202311447295 A CN202311447295 A CN 202311447295A CN 117524240 A CN117524240 A CN 117524240A
Authority
CN
China
Prior art keywords
signal
formant
original
voice signal
original voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311447295.2A
Other languages
Chinese (zh)
Inventor
宋明辉
王红丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhongke Lanxun Technology Co ltd
Original Assignee
Shenzhen Zhongke Lanxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhongke Lanxun Technology Co ltd filed Critical Shenzhen Zhongke Lanxun Technology Co ltd
Priority to CN202311447295.2A priority Critical patent/CN117524240A/en
Publication of CN117524240A publication Critical patent/CN117524240A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Abstract

The application provides a voice sound changing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring an original voice signal to be processed; and carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain a corresponding voice signal of the original voice signal. According to the technical scheme, the formants of the original voice signals are corrected from two aspects of cepstrum information and linear prediction coefficients, so that the formant structure of the original voice signals can be more accurately adjusted, the tone of the sound after sound change is more natural and real, and the sound change effect is improved.

Description

Voice sound changing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of speech processing, and in particular, to a method, apparatus, device, and storage medium for voice conversion.
Background
Voice masquerading devices, also known as sound transducers, are widely used in various areas of society. For example, the sound transformer can be applied to sound a person's voice to prevent striking and reporting; as another example, the sound transducer can also be applied to sound transducer of solitary women and children to deal with nuisance calls and stranger visits; as another example, the sound transformer may also be applied to small games that require sound variation, and so on.
At present, most sound transformers mainly realize sound changing by changing voice of people, and the sound changing effect is not good enough.
Disclosure of Invention
The application provides a voice sound changing method, a voice sound changing device, voice sound changing equipment and a voice sound changing storage medium, and aims to solve the technical problem that a sound changing effect brought by changing voice sound of a person is not good enough.
In a first aspect, a method for voice conversion is provided, including:
acquiring an original voice signal to be processed;
and carrying out formant cepstrum correction and formant linear prediction coefficient (linear predictive coefficients, LPC) correction on the original voice signal to obtain an acoustic voice signal corresponding to the original voice signal.
In the technical scheme, after an original voice signal to be processed is obtained, a formant cepstrum correction and a formant linear prediction coefficient correction are carried out on the original voice signal to obtain a variable-sound voice signal corresponding to the original voice signal; the formants of the original voice signals are corrected, so that the sound tone of the original voice signals can be changed, and the effect of changing the sound is achieved; the formants of the original voice signals are corrected from two aspects of cepstrum information and linear prediction coefficients, so that the formant structure of the original voice signals can be more accurately adjusted, the tone of the sound after sound change is more natural and real, and the sound change effect is improved.
With reference to the first aspect, in one possible implementation manner, the performing formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain an acoustic voice signal corresponding to the original voice signal includes: performing cepstrum information transformation on the original voice signal to obtain a first resonance peak correction factor; performing linear prediction coefficient transformation on the original voice signal to obtain a second formant correction factor; and correcting the formants of the original voice signals according to the first formant correction factors and the second formant correction factors to obtain the corresponding acoustic voice signals of the original voice signals.
With reference to the first aspect, in one possible implementation manner, the performing cepstrum information transformation on the original voice signal to obtain a first resonance peak correction factor includes: calculating the logarithmic spectrum of the original voice signal to obtain a first logarithmic spectrum signal; performing expansion processing on the first logarithmic spectrum signal to obtain a second logarithmic spectrum signal; calculating a signal difference between the second logarithmic spectrum signal and the first logarithmic spectrum signal to obtain a difference logarithmic spectrum signal; performing inverse Fourier transform on the difference value logarithmic spectrum signal to obtain a differential cepstrum signal; and determining a first resonance peak correction factor according to the differential cepstrum signal. The formant structure of the original voice signal can be corrected from the aspect of the cepstrum signal by performing stretching processing on the log spectrum of the original voice signal and solving a signal difference value between the log spectrum signal obtained by the stretching processing and the original log spectrum signal to obtain a first formant correction factor.
With reference to the first aspect, in one possible implementation manner, the performing scaling processing on the first log spectrum signal to obtain a second log spectrum signal includes: and carrying out interpolation operation on the first logarithmic spectrum signal to obtain the second logarithmic spectrum signal.
With reference to the first aspect, in one possible implementation manner, the performing linear prediction coefficient transformation on the original voice signal to obtain a second formant correction factor includes: calculating a linear prediction normalized envelope coefficient of the original voice signal; and performing expansion and contraction processing on the linear prediction normalized envelope coefficient to obtain the second formant correction factor. The formant structure of the original voice signal can be corrected from the LPC aspect by performing stretching and retracting processing on the linear prediction normalized envelope coefficient of the original voice signal to obtain a second formant correction factor.
With reference to the first aspect, in one possible implementation manner, the performing scaling processing on the linear prediction normalized envelope coefficient to obtain the second formant correction factor includes: and carrying out interpolation operation on the linear prediction normalized envelope coefficient to obtain the second formant correction factor.
With reference to the first aspect, in one possible implementation manner, the correcting, according to the first formant correction factor and the second formant correction factor, a formant of the original speech signal to obtain an acoustic speech signal corresponding to the original speech signal includes: fusing the first formant correction factor and the second formant correction factor to obtain a formant fitting factor; and correcting formants of the original voice signals by using the formant fitting factors to obtain the variable-sound voice signals.
With reference to the first aspect, in one possible implementation manner, the correcting, by using the formant fitting factor, a formant of the original speech signal to obtain the variable-sound speech signal includes: multiplying the formant fitting factor with a frequency domain signal corresponding to the original voice signal to obtain the variable-sound voice signal.
In a second aspect, there is provided a voice conversion apparatus comprising:
the voice signal acquisition module is used for acquiring an original voice signal to be processed;
and the correction module is used for carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain a variable-sound voice signal corresponding to the original voice signal.
In a third aspect, there is provided a computer device comprising a memory and one or more processors, the memory being connected to the one or more processors, the one or more processors being configured to execute one or more computer programs stored in the memory, the one or more processors, when executing the one or more computer programs, causing the computer device to implement the speech sound modification method of the first aspect.
In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the speech sound modification method of the first aspect.
The application can realize the following technical effects: the formants of the original voice signals are corrected, so that the sound tone of the original voice signals can be changed, and the effect of changing the sound is achieved; the formants of the original voice signals are corrected from two aspects of cepstrum information and linear prediction coefficients, so that the formant structure of the original voice signals can be more accurately adjusted, the tone of the sound after sound change is more natural and real, and the sound change effect is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a voice changing method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a voice sound conversion device according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that, if not conflicting, the various features in the embodiments of the present application may be combined with each other, which is within the protection scope of the present application. In addition, while functional block division is performed in a device diagram and logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. Moreover, the words "first," "second," "third," and the like as used herein do not limit the data and order of execution, but merely distinguish between identical or similar items that have substantially the same function and effect.
The technical scheme of the application can be applied to a signal processing scene, and particularly can be applied to a voice signal processing scene. The technical scheme of the method and the device can be used for carrying out sound transformation on the original voice signal in the voice signal processing scene to obtain the sound-variable voice signal corresponding to the original voice signal.
The technical scheme of the application can be applied to computer equipment with a signal processing function. The following specifically describes the technical scheme of the present application.
Referring to fig. 1, fig. 1 is a schematic flow chart of a voice sound conversion method according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
s101, acquiring an original voice signal to be processed.
Here, the original voice signal is a voice signal to be subjected to sound conversion, and the original voice signal to be processed can be obtained from a local or remote device of the device. When an original voice signal to be processed is locally acquired from the equipment, the voice signal can be detected by a sound detection part such as a microphone in the equipment, so that the original voice signal to be processed is obtained; when the original speech signal to be processed is obtained from the remote device, the original speech signal transmitted by the remote device may be received.
Since the speech signal is a continuous signal, the processing of the speech signal is generally performed on a frame-by-frame basis based on the short-time invariant characteristic of the speech signal. After the original voice signal to be processed is obtained, the original voice signal to be processed can be subjected to framing processing to obtain a plurality of voice signal frames corresponding to the original voice signal, wherein each voice signal frame can be expressed as x K (n), n=1, 2, …, M is the length of the speech signal frame corresponding to the original speech signal, i.e. the length of one speech signal frameThe number of sampling points, K, represents the frame sequence of the speech signal frames. Wherein, when the original voice signal is framed, two adjacent voice signal frames can be overlapped with each other, namely, the voice signal frame x obtained by framing K The voice signal corresponding to the last m sampling points in (n) is a voice signal frame x obtained by framing K+1 The voice signal corresponding to the first m sampling points in (n), voice signal frame x K (n) and Speech Signal frame x K+1 (n) two adjacent speech signal frames obtained by framing, speech signal frame x K+1 (n) is the speech signal frame x K The next speech signal frame of (n). The sample point overlap ratio between two adjacent speech signal frames may be 50% or 75% or the like, i.e. M/M is 50% or 75% or the like. Taking m=8 and the sampling point overlap ratio as 50%, let x be taken as an example K (n)={y t-4 ,y t-3 ,y t-2 ,y t-1 ,y t ,y t+1 ,y t+2 ,y t+3 X is }, then K+1 (n)={y t ,y t+1 ,y t+2 ,y t+3 ,y t+4 ,y t+5 ,y t+6 ,y t+7 },y t Representing the time domain speech signal corresponding to any one sampling point.
S102, carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain a variable-sound voice signal corresponding to the original voice signal.
Here, the formant cepstrum modification and the formant linear prediction coefficient modification are performed on the original voice signal, so as to obtain a variable-sound voice signal corresponding to the original voice signal, that is, from two aspects of cepstrum information and linear prediction coefficient of the original voice signal, the formant structure of the original voice signal is comprehensively modified, so as to obtain the variable-sound voice signal corresponding to the original voice signal.
In some possible cases, the formant cepstrum correction and the formant linear prediction coefficient correction can be performed on the original voice signal through the following steps A1-A3 to obtain a variable-sound voice signal corresponding to the original voice signal:
a1, carrying out cepstrum information transformation on the original voice signal to obtain a first resonance peak correction factor.
Here, the cepstrum information transformation may be performed on the original speech signal by the following steps a11-a15 to obtain the first formant correction factor:
a11, calculating a logarithmic spectrum of the original voice signal to obtain a first logarithmic spectrum signal.
The method comprises the steps that fast Fourier transform (fast fourier transform, FFT) operation can be carried out on an original voice signal, the original voice signal is converted from a time domain to a frequency domain, and a frequency domain signal corresponding to the original voice signal is obtained; then taking absolute value of frequency domain signal corresponding to original voice signal to obtain amplitude spectrum of original voice signal; and finally, taking the logarithm of the amplitude spectrum of the original voice signal to obtain a first logarithm spectrum signal.
The corresponding first pair of spectral signals for each speech signal frame may be expressed as: lF1=log|X K (n)|,X K (n)=FFT[x K (n)],x K (n) represents the Kth speech signal frame, X, in the original speech signal K (n) represents the frequency domain signal corresponding to the kth speech signal frame, |X K (n) | represents the magnitude spectrum of the kth speech signal frame, X K (n) and |X K N in (n) | represents a frequency point index of a frequency domain signal corresponding to a speech signal frame, n=1, 2, …, M, and one frequency point of the frequency domain signal corresponds to one sampling point of a time domain signal.
A12, performing expansion and contraction processing on the first logarithmic spectrum signal to obtain a second logarithmic spectrum signal.
In a possible implementation manner, interpolation operation may be performed on the first log spectrum signal to obtain a second log spectrum signal, where the second log spectrum signal may be expressed as: LF2. The first logarithmic spectrum signal can be subjected to interpolation operation by a single linear interpolation method, a bilinear interpolation method and a Lagrange interpolation method to obtain a second logarithmic spectrum signal.
Taking interpolation operation on the first logarithmic spectrum signal by a single linear interpolation method as an example, for the first logarithmic spectrum signal, the interpolation on the first logarithmic spectrum signal can be performed by the following formula to obtain a second logarithmic spectrum signal:
wherein, (X 0 ,Y 0 ) And (X) 1 ,Y 1 ) The coordinates corresponding to two adjacent frequency points in the first pair of spectrum signals are (X, Y) coordinates corresponding to the frequency points obtained through interpolation, X refers to the frequency point values, and Y refers to the signal values corresponding to the frequency points.
The first logarithmic spectrum signal is subjected to expansion processing by interpolation operation, so that the overall shape of the logarithmic spectrum signal can be unchanged.
Alternatively, the second pair of log spectrum signals may also be obtained by resampling the first pair of log spectrum signals. The present application is not limited.
A13, calculating a signal difference value between the second logarithmic spectrum signal and the first logarithmic spectrum signal to obtain a difference value logarithmic spectrum signal.
Here, calculating the signal difference between the second logarithmic spectrum signal and the first logarithmic spectrum signal refers to calculating the difference between the signal value of the second logarithmic spectrum signal and the signal value of the first logarithmic spectrum signal corresponding to the same frequency point, so as to obtain a difference logarithmic spectrum signal.
The calculation formula of the difference logarithmic spectrum signal is as follows: l3=lf2-LF 1.LF3 is the signal value of the difference value logarithmic spectrum signal, namely the Y value of the difference value logarithmic spectrum signal; LF2 is the signal value of the second logarithmic spectrum signal, namely the Y value of the second logarithmic spectrum signal; LF1 is the signal value of the first logarithmic spectrum signal, namely the Y value of the first logarithmic spectrum signal, and LF3, LF2 and LF1 correspond to the same X value.
And A14, carrying out inverse Fourier transform on the difference value logarithmic spectrum signal to obtain a differential cepstrum signal.
Wherein the differential cepstrum signal can be expressed as: cep=ifft (LF 3).
And A15, determining a first resonance peak correction factor according to the differential cepstrum signal.
The differential cepstrum signal can be subjected to FFT operation, then a first operation result obtained by the FFT operation is a real number, and then an index operation with a natural constant e as a base number is performed by taking a second operation result obtained by the real number as an index, so that a first resonance peak correction factor is obtained.
The calculation formula of the first resonance peak correction factor is as follows:
f1=e real(FFT(CEP))
f1 is a first resonance peak correction factor, real () represents taking a real number.
A2, performing linear prediction coefficient transformation on the original voice signal to obtain a second formant correction factor.
Here, the second formant correction factor may be obtained by performing linear prediction coefficient transformation on the original speech signal by the following steps a21-a 22:
a21, calculating a linear prediction normalized envelope coefficient of the original voice signal.
Here, the p-order prediction coefficient of the original speech signal may be calculated, and then the linear prediction normalized envelope coefficient may be determined from the p-order prediction coefficient of the original speech signal.
In one possible implementation, the p-order prediction coefficients of the original speech signal may be calculated based on a Levinson-Durbin recursive algorithm. The specific calculation mode for calculating the p-order prediction coefficient of the original voice signal based on the Levinson-Durbin recursive algorithm is as follows:
(1) Performing FFT operation on the original voice signal, converting the original voice signal from a time domain to a frequency domain to obtain a frequency domain signal corresponding to the original voice signal, wherein the frequency domain signal corresponding to the original voice signal is expressed as X K (n),X K The meaning of (n) can be seen from the description of step A11.
(2) The autocorrelation coefficient r (j) of the frequency domain signal corresponding to the original voice signal is calculated, and the calculation formula of the autocorrelation coefficient r (j) can be seen as follows:
(3) The p-order prediction coefficient is determined according to the autocorrelation coefficient r (j), and the specific calculation formula is as follows:
A i =a j ( p )1≤j≤p
A i the p-order prediction coefficient is obtained.
Alternatively, the p-order prediction coefficient of the original voice signal can be calculated based on a Shull recurrence algorithm; alternatively, the p-order prediction coefficient of the original voice signal can be calculated based on a covariance method, a lattice method and the like; the present application is not limited.
After the p-order prediction coefficient of the original voice signal is obtained by calculation, the linear prediction normalized envelope coefficient of the original voice signal can be calculated according to the following formula:
F K =|FFT[A i *R K ]|
F K the envelope coefficients are normalized for linear prediction of the original speech signal.
And A22, performing expansion and contraction treatment on the linear prediction normalized envelope coefficient to obtain a second formant correction factor.
In a possible implementation manner, interpolation operation may be performed on the linear prediction normalized envelope coefficient to obtain a second formant correction factor, where the second formant correction factor may be expressed as: f2. and the second logarithmic spectrum signal obtained by interpolation is the same as the second logarithmic spectrum signal obtained by interpolation, and interpolation operation can be carried out on the linear prediction normalized envelope coefficient by a single linear interpolation method, a bilinear interpolation method and a Lagrange interpolation method to obtain a second formant correction factor.
A3, correcting the formants of the original voice signals according to the first formant correction factors and the second correction factors to obtain the corresponding acoustic voice signals of the original voice signals.
In one possible implementation, the first formant correction factor and the second formant correction factor may be fused to obtain a formant fitting factor; and correcting formants of the original voice signals by using formant fitting factors to obtain the variable-sound voice signals.
The first formant correction factor and the second formant correction factor can be fused through the following formula, so that a formant fitting factor is obtained:
f1 K (n) is the first formant correction factor corresponding to the nth frequency point corresponding to the kth voice signal frame, f2 K (n) is a second formant correction factor corresponding to an nth frequency point corresponding to a kth voice signal frame, th is a fusion judgment threshold, eps is a division protection factor, eps is a constant close to 0, and f3 K And (n) is a formant fitting factor corresponding to an nth frequency point corresponding to a kth voice signal frame.
After the formant fitting factor is calculated, the formant fitting factor can be multiplied with a frequency domain signal corresponding to the original voice signal to obtain the variable-sound voice signal. The calculation formula of the variable-sound voice signal is as follows:
X’ K (n)=X K (n)*f3 K (n)
y’ K (n)=IFFT[X’ K (n)]
y’ K (n) is a modified speech signal.
In another possible implementation manner, the formants of the original voice signal can be corrected by using the first formant correction factor and the second formant correction factor respectively, so as to obtain a first correction signal and a second correction signal; and then fusing the first correction signal and the second correction signal to obtain a variable-sound voice signal corresponding to the original voice signal.
After each voice signal frame corresponding to the original voice signal is processed through the process to obtain the variable-sound voice signal corresponding to each voice signal frame, the variable-sound voice signals corresponding to each voice signal frame are added in an overlapping mode, and the variable-sound voice signal corresponding to the original voice signal can be obtained.
In the technical scheme corresponding to fig. 1, after an original voice signal to be processed is obtained, a formant cepstrum correction and a formant linear prediction coefficient correction are performed on the original voice signal to obtain an acoustic voice signal corresponding to the original voice signal; the formants of the original voice signals are corrected, so that the sound tone of the original voice signals can be changed, and the effect of changing the sound is achieved; the formants of the original voice signals are corrected from two aspects of cepstrum information and linear prediction coefficients, so that the formant structure of the original voice signals can be more accurately adjusted, the tone of the sound after sound change is more natural and real, and the sound change effect is improved.
The method of the present application is described above and the apparatus of the present application is described below.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a voice sound conversion device according to an embodiment of the present application. As shown in fig. 2, the voice transforming apparatus 20 includes:
a voice signal acquisition module 201, configured to acquire an original voice signal to be processed;
and the correction module 202 is configured to perform formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal, so as to obtain an acoustic voice signal corresponding to the original voice signal.
In one possible design, the correction module 202 is specifically configured to: performing cepstrum information transformation on the original voice signal to obtain a first resonance peak correction factor; performing linear prediction coefficient transformation on the original voice signal to obtain a second formant correction factor; and correcting the formants of the original voice signals according to the first formant correction factors and the second formant correction factors to obtain the corresponding acoustic voice signals of the original voice signals.
In one possible design, the correction module 202 is specifically configured to: calculating the logarithmic spectrum of the original voice signal to obtain a first logarithmic spectrum signal; performing expansion processing on the first logarithmic spectrum signal to obtain a second logarithmic spectrum signal; calculating a signal difference between the second logarithmic spectrum signal and the first logarithmic spectrum signal to obtain a difference logarithmic spectrum signal; performing inverse Fourier transform on the difference value logarithmic spectrum signal to obtain a differential cepstrum signal; and determining a first resonance peak correction factor according to the differential cepstrum signal.
In one possible design, the correction module 202 is specifically configured to: and carrying out interpolation operation on the first logarithmic spectrum signal to obtain the second logarithmic spectrum signal.
In one possible design, the correction module 202 is specifically configured to: calculating a linear prediction normalized envelope coefficient of the original voice signal; and performing expansion and contraction processing on the linear prediction normalized envelope coefficient to obtain the second formant correction factor.
In one possible design, the correction module 202 is specifically configured to: fusing the first formant correction factor and the second formant correction factor to obtain a formant fitting factor; and correcting formants of the original voice signals by using the formant fitting factors to obtain the variable-sound voice signals.
In one possible design, the correction module 202 is specifically configured to: fusing the first formant correction factor and the second formant correction factor to obtain a formant fitting factor; and correcting formants of the original voice signals by using the formant fitting factors to obtain the variable-sound voice signals.
In one possible design, the correction module 202 is specifically configured to: multiplying the formant fitting factor with a frequency domain signal corresponding to the original voice signal to obtain the variable-sound voice signal.
It should be noted that, in the embodiment corresponding to fig. 2, the details not mentioned in the foregoing description of the method embodiment may be referred to, and will not be repeated here.
After the device acquires the original voice signal to be processed, the device acquires an acoustic voice signal corresponding to the original voice signal by carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal; the formants of the original voice signals are corrected, so that the sound tone of the original voice signals can be changed, and the effect of changing the sound is achieved; the formants of the original voice signals are corrected from two aspects of cepstrum information and linear prediction coefficients, so that the formant structure of the original voice signals can be more accurately adjusted, the tone of the sound after sound change is more natural and real, and the sound change effect is improved.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and the computer device 30 includes a processor 301 and a memory 302. The memory 302 is connected to the processor 301, for example via a bus, to the processor 301.
The processor 301 is configured to support the computer device 30 to perform the corresponding functions in the method embodiments described above. The processor 301 may be a central processing unit (central processing unit, CPU), a network processor (network processor, NP), a hardware chip or any combination thereof. The hardware chip may be an application specific integrated circuit (application specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof.
The memory 302 is used for storing program codes and the like. Memory 302 may include Volatile Memory (VM), such as random access memory (random access memory, RAM); the memory 302 may also include a non-volatile memory (NVM), such as read-only memory (ROM), flash memory (flash memory), hard disk (HDD) or Solid State Drive (SSD); memory 302 may also include a combination of the types of memory described above.
Optionally, the computer device may also include a microphone or the like.
The processor 301 may call the program code to perform the following operations:
acquiring an original voice signal to be processed;
and carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain a corresponding voice signal of the original voice signal.
The present application also provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the method of the previous embodiments.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs stored in a computer-readable storage medium, which when executed, may include the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only memory (ROM), a random-access memory (Random Access memory, RAM), or the like.
The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims (11)

1. A method of voice conversion, comprising:
acquiring an original voice signal to be processed;
and carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain a corresponding voice signal of the original voice signal.
2. The method of claim 1, wherein performing formant cepstrum correction and formant linear prediction coefficient correction on the original speech signal to obtain a variable-sound speech signal corresponding to the original speech signal comprises:
performing cepstrum information transformation on the original voice signal to obtain a first resonance peak correction factor;
performing linear prediction coefficient transformation on the original voice signal to obtain a second formant correction factor;
and correcting the formants of the original voice signals according to the first formant correction factors and the second formant correction factors to obtain the corresponding acoustic voice signals of the original voice signals.
3. The method of claim 2, wherein said performing cepstral information transformation on said original speech signal to obtain a first formant correction factor comprises:
calculating the logarithmic spectrum of the original voice signal to obtain a first logarithmic spectrum signal;
performing expansion processing on the first logarithmic spectrum signal to obtain a second logarithmic spectrum signal;
calculating a signal difference between the second logarithmic spectrum signal and the first logarithmic spectrum signal to obtain a difference logarithmic spectrum signal;
performing inverse Fourier transform on the difference value logarithmic spectrum signal to obtain a differential cepstrum signal;
and determining a first resonance peak correction factor according to the differential cepstrum signal.
4. A method according to claim 3, wherein said scaling the first log spectrum signal to obtain a second log spectrum signal comprises:
and carrying out interpolation operation on the first logarithmic spectrum signal to obtain the second logarithmic spectrum signal.
5. The method of claim 2, wherein said performing linear prediction coefficient transform on said original speech signal to obtain a second formant correction factor comprises:
calculating a linear prediction normalized envelope coefficient of the original voice signal;
and performing expansion and contraction processing on the linear prediction normalized envelope coefficient to obtain the second formant correction factor.
6. The method of claim 5, wherein scaling the linear prediction normalized envelope coefficient to obtain the second formant correction factor comprises:
and carrying out interpolation operation on the linear prediction normalized envelope coefficient to obtain the second formant correction factor.
7. The method according to any one of claims 2-6, wherein said correcting the formants of the original speech signal according to the first formant correction factor and the second formant correction factor to obtain a modified-voice speech signal corresponding to the original speech signal includes:
fusing the first formant correction factor and the second formant correction factor to obtain a formant fitting factor;
and correcting formants of the original voice signals by using the formant fitting factors to obtain the variable-sound voice signals.
8. The method of claim 7, wherein modifying the formants of the original speech signal using the formant-fitting factors to obtain the modified speech signal comprises:
multiplying the formant fitting factor with a frequency domain signal corresponding to the original voice signal to obtain the variable-sound voice signal.
9. A speech sound modification apparatus, comprising:
the voice signal acquisition module is used for acquiring an original voice signal to be processed;
and the correction module is used for carrying out formant cepstrum correction and formant linear prediction coefficient correction on the original voice signal to obtain a variable-sound voice signal corresponding to the original voice signal.
10. A computer device comprising a memory, a processor connected to the processor for executing one or more computer programs stored in the memory, which processor, when executing the one or more computer programs, causes the computer device to implement the method of any of claims 1-8.
11. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-8.
CN202311447295.2A 2023-11-02 2023-11-02 Voice sound changing method, device, equipment and storage medium Pending CN117524240A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311447295.2A CN117524240A (en) 2023-11-02 2023-11-02 Voice sound changing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311447295.2A CN117524240A (en) 2023-11-02 2023-11-02 Voice sound changing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117524240A true CN117524240A (en) 2024-02-06

Family

ID=89765446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311447295.2A Pending CN117524240A (en) 2023-11-02 2023-11-02 Voice sound changing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117524240A (en)

Similar Documents

Publication Publication Date Title
KR101266894B1 (en) Apparatus and method for processing an audio signal for speech emhancement using a feature extraxtion
CN106486131B (en) A kind of method and device of speech de-noising
US20190172480A1 (en) Voice activity detection systems and methods
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN111128213B (en) Noise suppression method and system for processing in different frequency bands
US20200372925A1 (en) Method and device of denoising voice signal
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
US20050143997A1 (en) Method and apparatus using spectral addition for speaker recognition
JPH0916194A (en) Noise reduction for voice signal
CN108108357B (en) Accent conversion method and device and electronic equipment
JP2013186258A (en) Noise reduction method, program, and apparatus
US20140148933A1 (en) Sound Feature Priority Alignment
US20090144058A1 (en) Restoration of high-order Mel Frequency Cepstral Coefficients
WO2021000498A1 (en) Composite speech recognition method, device, equipment, and computer-readable storage medium
Nongpiur et al. Impulse-noise suppression in speech using the stationary wavelet transform
US20040199381A1 (en) Restoration of high-order Mel Frequency Cepstral Coefficients
Sun et al. An adaptive speech endpoint detection method in low SNR environments
JP7077645B2 (en) Speech recognition device
CN117524240A (en) Voice sound changing method, device, equipment and storage medium
JPS628800B2 (en)
CN112397087B (en) Formant envelope estimation method, formant envelope estimation device, speech processing method, speech processing device, storage medium and terminal
CN112687284B (en) Reverberation suppression method and device for reverberation voice
CN114360572A (en) Voice denoising method and device, electronic equipment and storage medium
Mallidi et al. Robust speaker recognition using spectro-temporal autoregressive models.
CN113593604A (en) Method, device and storage medium for detecting audio quality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination