CN108269579B

CN108269579B - Voice data processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN108269579B
Application number: CN201810049575.0A
Authority: CN
Inventors: 卓鹏鹏; 张康; 方博伟; 尤嘉华; 张伟
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2020-11-10
Anticipated expiration: 2038-01-18
Also published as: CN108269579A

Abstract

The invention provides a voice data processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of data processing. The method comprises the steps of obtaining initial frequency domain parameters of voice data; and then, obtaining target frequency domain parameters corresponding to the preset target MIDI audio, and modifying the initial frequency domain parameters according to the target frequency domain parameters to obtain the voice data after tone modification. The voice in the voice data can have the frequency domain parameters of the target MIDI audio, the voice data after tone modification can have the pitch characteristics of the target MIDI audio, the tone modification operation of the voice data is realized, and the tone modification of the voice data can be realized under the condition that the speed and the duration of the voice in the voice data are not changed. The phase place of the voice data after the tone modification is continuous, noise can not appear, mechanical sound can be avoided appearing simultaneously, and the tone modification effect is better. The method can be applied to correction of pitch in songs, or conversion from voice to singing voice and the like, and has a high application prospect in the field of voice processing.

Description

Voice data processing method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a voice data processing method and device, electronic equipment and a readable storage medium.

Background

The voice tone change is realized by a certain algorithm under the condition of not changing the speed of the voice of the audio file, and comprises the steps of translating the tone and converting the voice to a specific tone. The phase discontinuity of the existing variable modulation processing can occur, and the problem of noise can be generated.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, an electronic device and a readable storage medium for processing voice data, which can solve the above problems and achieve phase continuity of a voice after tone modification.

The technical scheme provided by the invention is as follows:

a method of speech data processing, comprising:

acquiring voice data and target MIDI audio, wherein the voice data comprises voice aligned with the target MIDI audio;

obtaining initial frequency domain parameters of the voice data;

obtaining target frequency domain parameters corresponding to preset target MIDI audio, wherein the initial frequency domain parameters comprise initial phases of the voice data, and the target frequency domain parameters comprise target phases corresponding to the target MIDI audio;

and modifying the initial frequency domain parameters according to the target frequency domain parameters, and transforming the pitch in the voice data to a target pitch in the target MIDI audio to obtain the modulated voice data.

Further, the step of obtaining initial frequency domain parameters of the voice data comprises:

acquiring voice data in the voice data at the time corresponding to the target pitch;

performing zero point drift and pre-emphasis processing on the voice data in the time corresponding to the target pitch;

and performing time-frequency conversion on the voice data subjected to zero point drift removal and pre-emphasis processing to obtain a frequency domain parameter of each frame of the voice data.

Further, the step of performing time-frequency conversion on the voice data subjected to the zero point drift and pre-emphasis processing comprises:

calculating the frame shift of each frame in the voice data;

framing and windowing the voice data according to the frame shift obtained by calculation and a preset window function;

and carrying out Fourier transform on each frame of voice data subjected to framing and windowing to obtain the frequency domain parameter of each frame in the voice data.

Further, the step of calculating the frame shift of each frame in the speech data comprises:

dividing a sampling rate by a target frequency to obtain a frame shift of each frame, wherein the target frequency is the frequency of the target MIDI audio, and the target frequency is calculated by adopting the following formula:

where F is the target frequency of the target MIDI audio and MIDINote is the pitch value included in the target MIDI audio.

Further, the target MIDI audio records a target frequency of a sound, and the step of obtaining a target frequency domain parameter corresponding to a preset target MIDI audio includes:

generating a target waveform with the same pitch as the target frequency and equal duration of voice data corresponding to the target frequency;

extracting a phase value of the target waveform as the target frequency domain parameter;

correspondingly, the step of modifying the frequency domain parameters of the voice data according to the frequency domain parameters of the target MIDI audio comprises:

replacing the phase value of the voice data at the position corresponding to the target waveform in the voice data with the phase value of the target waveform to obtain the frequency domain parameter of the voice data after tone modification;

and carrying out inverse Fourier transform on the frequency domain parameters of the voice data after tone modification, and processing the frequency domain parameters through an OLA overlapping and overlapping algorithm to obtain the voice data after tone modification.

The present invention also provides a voice data processing apparatus, comprising:

the data acquisition module is used for acquiring voice data and target MIDI audio, wherein the voice data comprises voice aligned with the target MIDI audio;

the voice data processing module is used for obtaining initial frequency domain parameters of the voice data;

a target MIDI audio processing module for obtaining target frequency domain parameters corresponding to preset target MIDI audio, wherein the initial frequency domain parameters comprise initial phases of the voice data, and the target frequency domain parameters comprise target phases corresponding to the target MIDI audio;

and the tone modification module is used for modifying the initial frequency domain parameters according to the target frequency domain parameters, and transforming the pitch in the voice data to a target pitch in the target MIDI audio to obtain the tone-modified voice data.

Further, the method for obtaining the initial frequency domain parameters of the voice data by the voice data processing module includes:

performing zero point drift removal and pre-emphasis processing on the voice data;

Further, the step of performing time-frequency conversion on the voice data subjected to the zero point drift and pre-emphasis processing by the voice data processing module includes:

calculating the frame shift of each frame in the voice data;

Further, the step of calculating the frame shift of each frame in the voice data by the voice data processing module includes:

where F is the target frequency of the target MIDI audio, MIDINote is the pitch value included in the target MIDI audio.

Further, the method for the target MIDI audio to record the target frequency of the sound, and the target MIDI audio processing module obtaining the target frequency domain parameter corresponding to the preset target MIDI audio includes:

correspondingly, the method for modifying the frequency domain parameters of the voice data according to the frequency domain parameters of the target MIDI audio by the transposition module comprises the following steps:

The present invention also provides an electronic device, including: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to:

obtaining initial frequency domain parameters of the voice data;

The invention also provides a readable storage medium, which comprises a computer program, and the computer program controls the electronic device where the readable storage medium is located to execute the voice data processing method of any one of claims 1-5 when running.

The embodiment of the application can enable the voice in the voice data to have the frequency domain parameters of the target MIDI audio, enable the voice data after tone modification to have the pitch characteristics of the target MIDI audio, realize tone modification operation on the voice data, and realize tone modification on the voice data under the condition of not changing the speed and duration of the voice in the voice data. The phase place of the voice data after the tone modification is continuous, noise can not appear, mechanical sound can be avoided appearing simultaneously, and the tone modification effect is better. The method can be applied to correction of pitch in songs, or conversion from voice to singing voice and the like, and has a high application prospect in the field of voice processing.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a voice data processing method according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating the sub-step of step S102 in a speech data processing method according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating the sub-step of step S103 in the speech data processing method according to the embodiment of the present invention.

Fig. 5 is a functional block diagram of a voice data processing apparatus according to an embodiment of the present invention.

Icon: 100-an electronic device; 111-a memory; 112-a memory controller; 113-a processor; 300-a voice data processing device; 310-a data acquisition module; 320-a voice data processing module; 330-target MIDI audio processing module; 340-tone changing module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The existing pitch-changing methods can be mainly divided into two categories: one is a time domain interpolation and splicing method, such as a synchronized overlapped-added fixed synthesis (SOLA-FS); another type is a frequency domain processing method, often referred to as a phase vocoder. The time domain processing method has the advantages that the calculated amount is small, the naturalness of the tone-changing result is good, but the splicing processing brings discontinuity of phases, and noise is generated; the frequency domain method needs time-frequency conversion, phase estimation and the like, needs a large amount of calculation, and mechanical sound exists in the pitch-modified voice.

Fig. 1 is a block diagram of an electronic device 100 according to a preferred embodiment of the invention. The electronic device 100 may include a voice data processing apparatus 300, a memory 111, a storage controller 112, and a processor 113.

The memory 111, the memory controller 112 and the processor 113 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The voice data processing apparatus 300 may include at least one software functional module which may be stored in the memory 111 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 113 is used for executing executable modules stored in the memory 111, such as software functional modules and computer programs included in the voice data processing apparatus 300.

The Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is used for storing a program, and the processor 113 executes the program after receiving an execution instruction. Access to the memory 111 by the processor 113 and possibly other components may be under the control of the memory controller 112.

The processor 113 may be an integrated circuit chip having signal processing capabilities. The Processor 113 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The embodiment of the present application provides a voice data processing method, which can implement tonal modification of voice data, and can be applied to the electronic device 100, as shown in fig. 2, where the method includes the following steps.

Step S101, acquiring voice data and a target MIDI audio, where the voice data includes a voice aligned with the target MIDI audio.

Step S102, obtaining initial frequency domain parameters of the voice data.

The voice data in the embodiment of the present application may be a piece of voice or a piece of song, the embodiment of the present application does not limit the duration and the content of the voice data, and the voice data may be selected and determined according to actual needs. In the embodiment of the present application, the tone of the voice in the voice data is modified by processing the voice data, the initial frequency domain parameter obtained by calculation may be an initial frequency domain parameter of each frame of data in the voice data, or may be a frequency domain parameter of only a frame that needs to be modified by calculation.

The pitch change in the embodiment of the present application refers to changing the pitch of sound in the voice data to change the pitch of a certain frame of voice to a desired pitch.

As shown in fig. 3, the step of obtaining initial frequency domain parameters of the speech data may include the following sub-steps.

And a substep S1021, performing zero point drift elimination and pre-emphasis processing on the voice data.

And a substep S1022, performing time-frequency conversion on the voice data subjected to the zero point drift removal and pre-emphasis processing to obtain a frequency domain parameter of each frame of the voice data.

Because the voice data can have zero drift, the zero drift can be improved by removing the zero drift. Meanwhile, voice data can be influenced by lip radiation, the high-frequency part of voice can be emphasized through pre-emphasis, the influence of the lip radiation is removed, and the high-frequency resolution of the voice is increased. The zero drift and pre-emphasis process can be calculated using the following equations.

x(n)＝x(n)-mean_x

Wherein x (n) is a sampling value corresponding to the nth point, and is an output value after zero point drift is removed, and mean _ x is a mean value of time domain amplitude of the speech segment obtained by calculation.

The pre-emphasis can be implemented by a first order FIR high pass filter. The specific calculation formula is as follows.

y(n)＝x(n)-ax(n-1)

Wherein y (n) is the output after preprocessing, x (n) is the audio without preprocessing, a is the pre-emphasis coefficient, generally 0.9-1.0, and optionally, a is 0.98.

The method for performing time-frequency conversion on the voice data subjected to the zero point drift and pre-emphasis processing can comprise the following three steps.

First, a frame shift for each frame in the speech data is calculated.

And thirdly, framing and windowing the voice data according to the frame shift obtained by calculation and a preset window function.

And then, carrying out Fourier transform on each frame of voice data after framing and windowing to obtain the frequency domain parameter of each frame in the voice data.

Calculating the frame shift of each frame in the voice data may use a sampling rate to divide a target frequency to obtain the frame shift of each frame, where the target frequency is the frequency of the target MIDI audio, and is calculated by using the following formula:

where F is the corresponding frequency pitch and MIDINote is the pitch value included in the target MIDI audio file. The value 110 may be replaced by 220 in order to raise by one octave.

A speech signal is a signal that changes with time, but the state of a sound-emitting organ changes at a much slower rate than the rate of sound vibration. We can consider that the speech signal is stationary in a very short time, i.e. stationary for a short time. So that we can frame the speech and then analyze it. The frame length is generally 10-30 milliseconds, and frame overlapping exists between frames. Windowing has two main functions: firstly, make the signal global more continuous, avoid appearing the gibbs effect. And secondly, the voice signal without periodicity originally presents partial characteristics of the periodic function. Windowing may be performed using a window function, several of which are listed below.

The rectangular window function is as follows:

the Hamming window function is as follows:

the Hanning Window function is as follows:

where N is the window length. And realizing windowing processing on the voice data through the window function.

The initial frequency domain parameters of the voice data can be obtained through the method.

Step S103, obtaining target frequency domain parameters corresponding to the preset target MIDI audio.

The target MIDI audio in the embodiment of the present application may include pitch information after the voice data needs to be transposed, and the target MIDI audio may be data of a duration equal to that of the voice data, and the target MIDI audio may be used as a reference for the transposition of the voice data. It will be appreciated that the voice data requiring transposition may be determined first, and then the target MIDI audio as the basis for the transposition may be determined. Or, the target MIDI audio may be determined first, and then the voice data with equal duration may be selected according to the duration of the target MIDI audio.

In this embodiment, the target MIDI audio may be a file in MIDI (musical instrument Digital interface) format, which includes pitch information at different time points based on time, duration of different pitches, and start and stop time points of different pitches. By determining the pitch information for the target MIDI audio, the pitch to which the voice data needs to be transposed can be determined. And according to the conversion relation between the pitches and the frequencies, the frequencies corresponding to different pitches can be determined.

It will be appreciated that in obtaining the speech data and the target MIDI audio, the start position of the transposition required and the corresponding target pitch to which the transposition is required may be determined first.

In detail, as shown in fig. 4, target frequency domain parameters of target MIDI audio may be determined by the following sub-steps.

And a substep S1031 of generating a target waveform having the same pitch as the target frequency and having the same duration as the voice data corresponding to the target frequency.

The frequency of the target waveform is the same as the frequency of a preset target frequency in the target MIDI audio, and the duration of the target waveform is equal to the duration of voice data corresponding to the preset target frequency.

As mentioned above, the target MIDI audio includes different pitch information, and the frequencies corresponding to different pitches can be determined according to the conversion relationship between the pitches and the frequencies, and these frequencies are the preset target frequencies included in the target MIDI audio. The frequency of the generated target waveform is the same as the frequency of the preset target frequency in the target MIDI audio, a plurality of preset target frequencies may be included in one target MIDI audio, target waveforms corresponding to the plurality of preset target frequencies may be generated, respectively, and the durations of the target waveforms are equal to the durations of voices at corresponding positions in the voice data, respectively.

The target waveform may be determined according to actual needs, for example, a sine wave or a deformation of a sine wave may be generated as the target waveform. Since the vocal cords of a human being are sounds that directly generate sine waves, and the vibrations of the vocal cords when speaking are similar to waveforms of the chord type. When performing a transposition operation on all of the speech data, a targeted waveform selection can be performed for speech at different points in time. Sine waves can be selected as target waveforms of voice pitch change at all time points, and different target waveforms can also be generated aiming at voice data at different time points. Different target waveforms may correspond to different timbres, so that human hearing experiences are also different.

In detail, the target waveform may be generated by the following method.

Firstly, the number of sampling points corresponding to a target waveform at a target pitch is obtained. Calculated by the following formula.

Len＝Fs/F

Where Len is the number of sampling points corresponding to one period of the target waveform, Fs is the sampling rate, and F is the target frequency.

Then, the sampling interval is calculated.

delta1＝(4*π)/Len

delta2＝(2*π)/Len

And then calculating sampling values corresponding to different target waveforms. The reference tone color 1 can be expressed as:

y[n]＝(sin(-3*π+n*delta1))/(-3*π+n*delta1)

the reference timbre 2 can be expressed as:

y[n]＝(sin(n*delta2)+abs(sin(n*delta2))*alpha)/(1+alpha)

wherein y is all sampling values corresponding to one period of the waveform, n is a sampling point, n is more than or equal to 0 and less than Len, abs () is used for solving an absolute value, and alpha is more than 0 and less than 1. After repeating the data of one period for many times, the waveform sampling value data with the same length as the target voice can be obtained.

And a substep S1032 of extracting a phase value of the target waveform.

After the corresponding target waveform is generated, the target waveform may be firstly subjected to framing and windowing processing to keep the frame length of the target waveform consistent with the frame length of the voice data, and then subjected to short-time fourier transform, and a corresponding phase value after each frame of target waveform is transformed is extracted as a target frequency domain parameter of the target MIDI audio.

And step S104, modifying the initial frequency domain parameters according to the target frequency domain parameters, and transforming the pitch in the voice data to a target pitch in the target MIDI audio to obtain the tone-modified voice data.

After the target frequency domain parameters are obtained through the steps, the initial frequency domain parameters of the voice data can be replaced by the target frequency domain parameters, and the initial frequency domain parameters are modified. Specifically, the initial phase of the voice data is replaced with the phase value of the corresponding target waveform. Since the voice data includes unvoiced sound and voiced sound, and unvoiced sound has no periodicity, if the phase value is also replaced for the initial phase corresponding to unvoiced sound, the result after pitch modification is degraded. In the embodiment of the present application, the phase value may be replaced only for the sound frame corresponding to the voiced sound, the phase value of the unvoiced sound is not replaced, and the original phase value is still used for the voice data corresponding to the unvoiced sound.

In detail, the phase value of the voice data at the position corresponding to the target waveform in the voice data may be replaced with the phase value of the target waveform, so as to obtain the frequency domain parameter of the voice data after tone modification.

And performing inverse Fourier transform on the frequency domain parameters of the voice data after tone modification, and processing the frequency domain parameters through an OLA (Overlap-and-Add) overlapping algorithm to obtain the voice data after tone modification. The modified voice data can be output, stored and the like.

An embodiment of the present application further provides a voice data processing apparatus 300, as shown in fig. 5, including:

a data obtaining module 310, configured to obtain voice data and target MIDI audio, where the voice data includes voice aligned with the target MIDI audio;

a voice data processing module 320, configured to obtain initial frequency domain parameters of the voice data;

a target MIDI audio processing module 330, configured to obtain target frequency domain parameters corresponding to a preset target MIDI audio, where the initial frequency domain parameters include an initial phase of the voice data, and the target frequency domain parameters include a target phase corresponding to the target MIDI audio;

and the transposition module 340 is configured to modify the initial frequency domain parameter according to the target frequency domain parameter, and transform a pitch in the voice data to a target pitch in the target MIDI audio to obtain transposed voice data.

It is understood that the method for the voice data processing module 320 to obtain the initial frequency domain parameters of the voice data includes:

In this embodiment, the step of performing time-frequency conversion on the voice data subjected to the zero point drift and pre-emphasis processing by the voice data processing module 320 includes:

calculating the frame shift of each frame in the voice data;

In this embodiment, the step of the voice data processing module 320 calculating the frame shift of each frame in the voice data includes:

In this embodiment, the target MIDI audio is recorded with a target frequency of a sound, and the method for modifying the frequency domain parameters of the voice data by the transposition module 340 according to the frequency domain parameters of the target MIDI audio includes:

extracting phase values of the target waveform;

In the embodiment of the application, a target waveform corresponding to the voice data is generated according to the target MIDI audio, the target waveform is generated based on the pitch information contained in the target MIDI audio, and the phase value of the target waveform is used for replacing the phase value of the voice in the voice data. And modifying the frequency domain parameters of the voice data into frequency domain parameters corresponding to the target MIDI audio, so that the voice data has the pitch characteristics of the target MIDI audio, and realizing tone modification processing of the voice data. According to the voice data phase value replacement method and device, the phase value of the voice data is not set to zero through replacement of the phase value, and the situations of phase discontinuity and mechanical sound can be avoided while tone changing is achieved. Meanwhile, the target waveform is used for replacing the phase value of the voice data, so that the modified voice data can have the sound effect of the target waveform, and the modified voice has the tone color property of the target waveform.

In summary, by modifying the frequency domain parameters of the voice data by using the frequency domain parameters of the target MIDI audio, the voice in the voice data can have the frequency domain parameters of the target MIDI audio, and the modified voice data can have the pitch characteristics of the target MIDI audio, so as to implement the tone modification operation on the voice data, and implement the tone modification on the voice data without changing the speed and duration of the voice in the voice data. The phase place of the voice data after the tone modification is continuous, noise can not appear, mechanical sound can be avoided appearing simultaneously, and the tone modification effect is better. The method can be applied to correction of pitch in songs, or conversion from voice to singing voice and the like, and has a high application prospect in the field of voice processing.

The method is obtained by improving the traditional pitch-shifting algorithm based on the zero phase, and improves the conditions of phase discontinuity and mechanical sound by adding the phase value corresponding to the waveform with the same frequency. Meanwhile, some tone information added with waveforms is added to the original voice, so that different tonal modification results can be obtained by adding different waveforms, and the diversity of tonal modification is increased. In application, each user can obtain an individualized tonal modification result in a mode of enabling the user to select waveforms by himself, and the method has a good practical background. Compared with the traditional zero-phase-based method, the method has the advantages that the mechanical sound condition is better improved, and the phase continuity is obviously improved compared with the traditional time domain method.

The method provided by the embodiment of the application can be combined with a voice speed changing method, and the voice can be automatically synthesized by combining the modified dry voice and background music by combining a sound mixing technology. Because the tone-changing algorithm in the method can realize the personalized tone-changing, the method can realize the personalized singing voice synthesis in the singing voice synthesis. Different singing voice synthesis outputs can be controlled through different added waveforms, and the waveforms are selectable by a user, so that the user can select different effects according to own preference, and the practicability of the method is improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for processing voice data, comprising:

obtaining initial frequency domain parameters of the voice data;

modifying the initial frequency domain parameters according to the target frequency domain parameters, and transforming the pitch in the voice data to a target pitch in the target MIDI audio to obtain the modulated voice data;

the step of modifying the initial frequency domain parameters according to the target frequency domain parameters comprises:

replacing an initial phase of voiced speech in the speech data with a phase value of a corresponding target waveform.

2. The speech data processing method of claim 1, wherein the step of obtaining initial frequency domain parameters of the speech data comprises:

3. The speech data processing method according to claim 2, wherein the step of performing time-frequency conversion on the speech data subjected to the zero point drift elimination and pre-emphasis processing comprises:

calculating the frame shift of each frame in the voice data;

4. The method of claim 3, wherein the step of calculating the frame shift for each frame in the speech data comprises:

5. The method of claim 1, wherein the target MIDI audio is recorded with a target frequency of a sound, and the step of obtaining target frequency domain parameters corresponding to a preset target MIDI audio comprises:

6. A speech data processing apparatus, comprising:

a tone modification module, configured to modify the initial frequency domain parameter according to the target frequency domain parameter, and transform a pitch in the voice data to a target pitch in the target MIDI audio, to obtain tone-modified voice data;

the transposition module is further used for replacing the initial phase of the voiced sound in the voice data with the phase value of the corresponding target waveform.

7. The apparatus according to claim 6, wherein the means for obtaining the initial frequency-domain parameters of the speech data by the speech data processing module comprises:

8. The apparatus as claimed in claim 7, wherein the step of performing time-frequency conversion on the voice data processed by the voice data processing module and subjected to the zero point drift and pre-emphasis processing comprises:

calculating the frame shift of each frame in the voice data;

9. The speech data processing device of claim 7, wherein the step of the speech data processing module calculating a frame shift for each frame of the speech data comprises:

10. The apparatus of claim 6, wherein the target MIDI audio is recorded with a target frequency of a sound, and the method for obtaining the target frequency domain parameters corresponding to the preset target MIDI audio by the target MIDI audio processing module comprises:

11. An electronic device, characterized in that the electronic device comprises: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to:

obtaining initial frequency domain parameters of the voice data;

12. A readable storage medium comprising a computer program, wherein the computer program controls an electronic device where the readable storage medium is located to execute the voice data processing method according to any one of claims 1 to 5 when the computer program runs.