CN108269579A

CN108269579A - Voice data processing method, device, electronic equipment and readable storage medium storing program for executing

Info

Publication number: CN108269579A
Application number: CN201810049575.0A
Authority: CN
Inventors: 卓鹏鹏; 张康; 方博伟; 尤嘉华; 张伟
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2018-07-10
Anticipated expiration: 2038-01-18
Also published as: CN108269579B

Abstract

The present invention provides a kind of voice data processing method, device, electronic equipment and readable storage medium storing program for executing, are related to technical field of data processing.This method is by obtaining the initial frequency-domain parameter of voice data；Target frequency domain parameter corresponding with preset target MIDI audios is obtained again, is then modified according to the target frequency domain parameter to the initial frequency-domain parameter, the voice data after being modified tone.It can make the voice in voice data that there is the frequency domain parameter of target MIDI audios, allow the voice data after modified tone that there are the pitch parameters of target MIDI audios, realize that the modified tone to voice data operates, it can realize and not change in voice data in the case of word speed and voice duration, modify tone to voice data.The Phase Continuation of voice data after modified tone is not in noise, while can avoid the occurrence of mechanical sound, and modified tone effect is more preferable.The amendment of pitch or voice be can be applied in song to the conversion of song etc., it is with high application prospect in acoustic processing field.

Description

Voice data processing method, device, electronic equipment and readable storage medium storing program for executing

Technical field

The present invention relates to technical field of data processing, in particular to a kind of voice data processing method, device, electricity Sub- equipment and readable storage medium storing program for executing.

Background technology

Breaking of voice is in the case where not changing audio file word speed, and changing for speaker's intonation is realized by certain algorithm Become, translation including intonation and will be on phonetic modification to specific tone.It is discontinuous that existing modified tone processing will appear phase, and And noise can be led to the problem of.

Invention content

In view of this, the present invention provides a kind of voice data processing method, device, electronic equipment and readable storage mediums Matter can solve the above problems, and realize the voice Phase Continuation after modified tone.

Technical solution provided by the invention is as follows：

A kind of voice data processing method, including：

Voice data and target MIDI audios are obtained, the voice data is included after being aligned with the target MIDI audios Voice；

Obtain the initial frequency-domain parameter of the voice data；

Target frequency domain parameter corresponding with preset target MIDI audios is obtained, wherein the initial frequency-domain parameter includes institute The initial phase of voice data is stated, the target frequency domain parameter includes target phase corresponding with the target MIDI audios；

It is modified according to the target frequency domain parameter to the initial frequency-domain parameter, by the pitch in the voice data Transform to the target pitch in the target MIDI audios, the voice data after being modified tone.

Further, the step of initial frequency-domain parameter for obtaining voice data, includes：

Obtain temporal voice data corresponding with the target pitch in the voice data；

Pair temporal voice data corresponding with the target pitch carries out null offset and preemphasis processing；

To by null offset and the voice data of preemphasis processing is gone to carry out time-frequency convert, it is every to obtain the voice data The frequency domain parameter of one frame.

Further, to passing through the step of voice data of null offset and preemphasis processing is gone to carry out time-frequency convert packet It includes：

The frame for calculating each frame in the voice data moves；

It is moved according to the frame being calculated and preset window function carries out framing, adding window to the voice data；

Each frame voice data after framing, adding window is subjected to Fourier transformation, obtains each frame in the voice data Frequency domain parameter.

Further, the step of frame shifting for calculating each frame in the voice data, includes：

The frame that each frame is obtained using sample rate divided by target frequency is moved, wherein the target frequency is the target MIDI The frequency of audio, target frequency are calculated using the following formula：

Wherein, F is the target frequency of the target MIDI audios, and MIDINote is the sound that the target MIDI audios include High level.

Further, the target MIDI audio recordings have the target frequency of sound, obtain and preset target MIDI sounds Frequently the step of corresponding target frequency domain parameter includes：

Generation pitch identical with the target frequency, and the target of the durations such as voice data corresponding with the target frequency Waveform；

The phase value of the target waveform is extracted, as the target frequency domain parameter；

Correspondingly, it is modified according to the frequency domain parameter of the target MIDI audios to the frequency domain parameter of the voice data The step of include：

In the voice data target will be replaced with the phase value of the voice data of the target waveform corresponding position The phase value of waveform, the frequency domain parameter of the voice data after being modified tone；

Inverse Fourier transform is carried out, and be overlapped at superposition algorithm by OLA to the frequency domain parameter of the voice data after modified tone Voice data after being modified tone after reason.

The present invention also provides a kind of voice data processing apparatus, including：

Data acquisition module, for obtaining voice data and target MIDI audios, the voice data includes and the mesh Mark the voice after the alignment of MIDI audios；

Language data process module, for obtaining the initial frequency-domain parameter of the voice data；

Target MIDI audio processing modules obtain target frequency domain parameter corresponding with preset target MIDI audios, wherein The initial frequency-domain parameter includes the initial phase of the voice data, and the target frequency domain parameter includes and the target MIDI The corresponding target phase of audio；

Modify tone module, for being modified according to the target frequency domain parameter to the initial frequency-domain parameter, by institute's predicate Pitch in sound data transforms to the target pitch in the target MIDI audios, the voice data after being modified tone.

Further, the method that the language data process module obtains the initial frequency-domain parameter of voice data includes：

Null offset and preemphasis processing are carried out to the voice data；

Further, the language data process module to through going the voice data that null offset and preemphasis are handled into The step of row time-frequency convert, includes：

The frame for calculating each frame in the voice data moves；

Further, the step of frame of each frame moves in the language data process module calculating voice data is wrapped It includes：

Wherein F is the target frequency of the target MIDI audios, and MIDINote is the sound that the target MIDI audios include High level.

Further, the target MIDI audio recordings have the target frequency of sound, the target MIDI audio frequency process moulds The method that block obtains target frequency domain parameter corresponding with preset target MIDI audios includes：

Correspondingly, the modified tone module according to the frequency domain parameters of the target MIDI audios to the frequency domain of the voice data The method that parameter is modified includes：

The present invention also provides a kind of electronic equipment, the electronic equipment includes：Processor and memory, the storage Device is couple to the processor, and the memory store instruction makes the electronics when executed by the processor Equipment performs following operate：

Obtain the initial frequency-domain parameter of the voice data；

The present invention also provides a kind of readable storage medium storing program for executing, the readable storage medium storing program for executing includes computer program, the meter In electronic equipment perform claim requirement 1-5 calculation machine program controls the readable storage medium storing program for executing when running where described in any one Voice data processing method.

The embodiment of the present application can make frequency domain parameter of the voice in voice data with target MIDI audios, after making modified tone Voice data can have target MIDI audios pitch parameters, realize to voice data modified tone operation, can realize not Change in voice data in the case of word speed and voice duration, modify tone to voice data.The phase of voice data after modified tone Position is continuous, is not in noise, while can avoid the occurrence of mechanical sound, and modified tone effect is more preferable.It can be applied to pitch in song Amendment or voice to the conversion of song etc., it is with high application prospect in acoustic processing field.

For the above objects, features and advantages of the present invention is enable to be clearer and more comprehensible, preferred embodiment cited below particularly, and coordinate Appended attached drawing, is described in detail below.

Description of the drawings

It in order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range, for those of ordinary skill in the art, without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 is the block diagram of electronic equipment provided in an embodiment of the present invention.

Fig. 2 is a kind of flow diagram of voice data processing method provided in an embodiment of the present invention.

Fig. 3 is that the flow of the sub-step of step S102 in a kind of voice data processing method provided in an embodiment of the present invention is shown It is intended to.

Fig. 4 is that the flow of the sub-step of step S103 in a kind of voice data processing method provided in an embodiment of the present invention is shown It is intended to.

Fig. 5 is a kind of high-level schematic functional block diagram of voice data processing apparatus provided in an embodiment of the present invention.

Icon：100- electronic equipments；111- memories；112- storage controls；113- processors；At 300- voice data Manage device；310- data acquisition modules；320- language data process modules；330- target MIDI audio processing modules；340- becomes Mode transfer block.

Specific embodiment

Below in conjunction with attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Usually exist The component of the embodiment of the present invention described and illustrated in attached drawing can be configured to arrange and design with a variety of different herein.Cause This, the detailed description of the embodiment of the present invention to providing in the accompanying drawings is not intended to limit claimed invention below Range, but it is merely representative of the selected embodiment of the present invention.Based on the embodiment of the present invention, those skilled in the art are not doing Go out all other embodiments obtained under the premise of creative work, shall fall within the protection scope of the present invention.

It should be noted that：Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, does not then need to that it is further defined and explained in subsequent attached drawing.Meanwhile the present invention's In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.

Current existing modified tone method can be mainly divided into two major class：One kind is the method for time domain interpolation splicing, such as together Tired fold of step is mutually fixed synthetic method (synchronized overlap-add fixed synthesis, SOLA-FS)；It is another kind of It is frequency domain facture, is commonly referred to as phase vocoder (phase vocoder).The advantages of time-domain processing method is that calculation amount is small, And the naturalness of modified tone result is preferable, but since splicing can bring the discontinuous of phase, generates noise；Frequency domain method needs Time-frequency convert, phase estimation etc. are carried out, needs the voice after larger operand and modified tone that there can be mechanical sound.

Fig. 1 is please referred to, is the block diagram for a kind of electronic equipment 100 that present pre-ferred embodiments provide.It is described Electronic equipment 100 can include voice data processing apparatus 300, memory 111, storage control 112 and processor 113.

The memory 111, storage control 112 and 113 each element of processor are directly or indirectly electrical between each other Connection, to realize the transmission of data or interaction.For example, these elements can pass through one or more communication bus or letter between each other Number line, which is realized, to be electrically connected.The voice data processing apparatus 300 can include it is at least one can be with software or firmware (firmware) form is stored in the memory 111 or is solidificated in the operating system of the electronic equipment 100 Software function module in (operating system, OS).The processor 113 is deposited for performing in the memory 111 The executable module of storage, such as software function module included by the voice data processing apparatus 300 and computer program etc..

Wherein, the memory 111 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..Wherein, memory 111 is for storing program, the processor 113 after execute instruction is received, Perform described program.The processor 113 and other possible components can control the access of memory 111 in the storage It is carried out under the control of device 112.

The processor 113 may be a kind of IC chip, have the processing capacity of signal.Above-mentioned processor 113 can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；It can also be digital signal processor (DSP), application-specific integrated circuit (ASIC), ready-made Programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware group Part.It can realize or perform disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be with It is microprocessor or the processor can also be any conventional processor etc..

The embodiment of the present application provides a kind of voice data processing method, can realize the modified tone to voice data, can be with Applied to above-mentioned electronic equipment 100, as shown in Fig. 2, this method includes the following steps.

Step S101, obtains voice data and target MIDI audios, and the voice data includes and the target MIDI sounds Voice after frequency alignment.

Step S102 obtains the initial frequency-domain parameter of voice data.

Voice data in the embodiment of the present application can be one section of voice or one section of song, the embodiment of the present application The duration and content of voice data are not limited, which can select to determine according to actual needs.The embodiment of the present application By realizing the modified tone to voice in voice data to the processing of voice data, the initial frequency-domain parameter being calculated can be language The initial frequency-domain parameter of each frame data in sound data can also only calculate the frequency domain parameter of frame for needing to modify tone, in the application In embodiment, the initial frequency-domain parameter of voice data can include the phase and amplitude of sound in voice data.

Modified tone in the embodiment of the present application refers to change the pitch of sound in voice data, and the pitch of a certain frame voice is changed Become the pitch needed.

As shown in figure 3, the step of obtaining the initial frequency-domain parameter of voice data can include following sub-step.

Sub-step S1021 carries out the voice data null offset and preemphasis processing.

Sub-step S1022, to by null offset and the voice data of preemphasis processing is gone to carry out time-frequency convert, obtaining institute State the frequency domain parameter of each frame of voice data.

Since voice data can have null offset, by going null offset that can improve the feelings of null offset Condition.Voice data can be radiated by lip and be influenced simultaneously, and the high frequency section of voice can be aggravated by preemphasis, removed The influence of lip radiation increases the high frequency resolution of voice.Go null offset and preemphasis handle may be used the following formula into Row calculates.

X (n)=x (n)-mean_x

Wherein x (n) is the corresponding sampled value of nth point, and to go the output valve after null offset, mean_x is calculated The mean value of the temporal amplitude of this section of voice.

Preemphasis can be realized by single order FIR high-pass filters.Specific formula for calculation is as follows.

Y (n)=x (n)-ax (n-1)

Wherein y (n) is pretreated output, and x (n) is not pretreated audio, and a is pre emphasis factor, is generally taken 0.9~1.0, optionally, a takes 0.98.

To following by going the method that the voice data of null offset and preemphasis processing carries out time-frequency convert that can include Three steps.

First, the frame for calculating each frame in the voice data moves.

Again, it is moved according to the frame being calculated and preset window function carries out framing, adding window to the voice data.

Then, each frame voice data after framing, adding window is subjected to Fourier transformation, obtained every in the voice data The frequency domain parameter of one frame.

The frame for calculating each frame in the voice data moves and sample rate divided by target frequency can be utilized to obtain each frame Frame moves, wherein the target frequency is the frequency of the target MIDI audios, is calculated using the following formula：

Wherein F is corresponding frequency pitch, and MIDINote is the pitch value that target MIDI audio files include.To improve Numerical value 110 can be substituted for 220 by one octave.

Voice signal is a kind of signal of time to time change, however the state change speed of phonatory organ is compared with acoustical vibration Speed it is slowly more.One can consider that voice signal is stable, i.e. short-term stationarity in a short period of time.In this way Voice can be carried out framing and then analyzed by us.General frame length is 10~30 milliseconds, has frame to fold between frame and frame.Adding window It is main that there are two act on：First, making the signal overall situation more continuous, Gibbs' effect is avoided the occurrence of.Second is that make originally without periodically Voice signal show the Partial Feature of periodic function.Window function may be used and carry out windowing process, several windows are set forth below Function.

Rectangular window function is as follows：

Hamming window function is as follows：

Hanning window function is as follows：

Wherein N is long for window.Windowing process to voice data is realized by above-mentioned window function.

The initial frequency-domain parameter of voice data can be obtained by the above method.

Step S103 obtains target frequency domain parameter corresponding with preset target MIDI audios.

Target MIDI audios in the embodiment of the present application, which can include voice data, needs the pitch information after modifying tone, the mesh Mark MIDI audios can be the data of the durations such as one section and voice data, and target MIDI audios can be used as voice data to modify tone Benchmark.It is understood that the voice data for needing to modify tone can be determined first, then it is determined as the target MIDI of modified tone benchmark Audio.Target MIDI audios can also be first determined, further according to the voice data of the durations such as the duration selection of target MIDI audios.

In the embodiment of the present application, target MIDI audios can be MIDI (Musical instrument Digital Interface) the file of form is held wherein including the pitch information of different time points and different pitches on the basis of the time Continuous duration and the starting of different pitches and termination time point.It, can be true by determining the pitch information of target MIDI audios Determining voice data needs to become the pitch being transferred to.And according to pitch and the transformational relation of frequency, it may be determined that different pitches correspond to Frequency.

It is understood that during voice data and target MIDI audios is obtained, can first determine to need to modify tone Initial position and corresponding need to become the target pitch be transferred to.

Detailed, as shown in figure 4, the target frequency domain parameter of target MIDI audios can be determined by following sub-step.

Sub-step S1031 generates pitch identical with the target frequency, and voice data corresponding with the target frequency Wait the target waveform of durations.

The frequency of the target waveform is identical with the frequency of the goal-selling frequency in the target MIDI audios, the mesh The duration for marking waveform is equal with the duration of the corresponding voice data of goal-selling frequency.

As previously mentioned, include different pitch informations in target MIDI audios, by pitch and the transformational relation of frequency, It can determine the corresponding frequency of different pitches, these frequencies are the goal-selling frequency that target MIDI audios include.Generation Target waveform frequency it is identical with the frequency of the goal-selling frequency in target MIDI audios, in a target MIDI audio Multiple goal-selling frequencies can be included, can generate respectively with multiple corresponding target waveforms of goal-selling frequency, and And the duration of target waveform is equal with the duration difference of the voice of corresponding position in voice data.

Target waveform can determine according to actual needs, such as can generate the deformation of sine wave or sine wave as mesh Mark waveform.Since the vocal cords of the mankind are the sound that directly generates sine wave, and the vibration of vocal cords and the waveform of string class when speaking It is similar.When all in voice data are carried out with modified tone operation, the voice that can be directed to different time points carries out targetedly Waveform selection.Target waveform of the sine wave as the breaking of voice at all time points can be selected, different time can also be directed to The voice data of point generates different target waveforms.Different target waveforms can correspond to different tone colors so that human auditory Impression is also different.

Detailed, can target waveform be generated by following method.

First, a target waveform corresponding sampling number in target pitch is obtained.It is calculated by equation below.

Len=Fs/F

Wherein Len is the corresponding sampling number of target waveform a cycle, and Fs is sample rate, and F is target frequency.

Then, the sampling interval is calculated.

Delta1=(4* π)/Len

Delta2=(2* π)/Len

The corresponding sampled value of different target waveform is calculated again.It is represented by with reference to tone color 1：

Y [n]=(sin (- 3* π+n*delta1))/(- 3* π+n*delta1)

It is represented by with reference to tone color 2：

Y [n]=(sin (n*delta2)+abs (sin (n*delta2)) * alpha)/(1+alpha)

Wherein y is the corresponding all sampled values of waveform a cycle, and n is sampled point, 0≤n<Len, abs () is ask absolute Value, alpha are more than 0 and are less than 1.After data by repeating a cycle are multiple, you can obtain and the wave of target voice equal length Shape sample values.

Sub-step S1032 extracts the phase value of the target waveform.

After generating corresponding target waveform, can framing, windowing process first be carried out to target waveform, make the frame of target waveform The long frame length with voice data is consistent, then carry out Short Time Fourier Transform, is corresponded to after extracting each frame target waveform transformation Phase value, the target frequency domain parameter as target MIDI audios.

Step S104 modifies to the initial frequency-domain parameter according to the target frequency domain parameter, by the voice number Pitch in transforms to the target pitch in the target MIDI audios, the voice data after being modified tone.

After target frequency domain parameter having been obtained by above-mentioned steps, it is possible to replace with the initial frequency-domain parameter of voice data Target frequency domain parameter realizes the modification to initial frequency-domain parameter.Specifically, the initial phase of voice data is replaced with corresponding The phase value of target waveform.Due to including voiceless sound and voiced sound in voice data, and voiceless sound does not have periodically, if to voiceless sound Corresponding initial phase also carries out phase value replacement, and the result after modified tone will be made to be deteriorated.Phase value in the embodiment of the present application Replacement the phase value of voiceless sound can not be replaced, the corresponding voice data of voiceless sound is still only for voiced sound corresponding sound frame So use original phase value.

Detailed, it can be first by the phase value with the voice data of the target waveform corresponding position in the voice data The phase value of the target waveform is replaced with, the frequency domain parameter of the voice data after being modified tone.

Inverse Fourier transform is carried out, and pass through OLA (Overlap-and- to the frequency domain parameter of the voice data after modified tone Add the voice data after) being modified tone after the processing of overlapping superposition algorithm.Voice data after modified tone can be carried out exporting, protect Other operations such as deposit.

The embodiment of the present application additionally provides a kind of voice data processing apparatus 300, as shown in figure 5, including：

Data acquisition module 310, for obtaining voice data and target MIDI audios, the voice data include with it is described Voice after the alignment of target MIDI audios；

Language data process module 320, for obtaining the initial frequency-domain parameter of the voice data；

Target MIDI audio processing modules 330 obtain target frequency domain parameter corresponding with preset target MIDI audios, Described in initial frequency-domain parameter include the initial phase of the voice data, the target frequency domain parameter includes and the target The corresponding target phase of MIDI audios；

Modify tone module 340, for being modified according to the target frequency domain parameter to the initial frequency-domain parameter, by described in Pitch in voice data transforms to the target pitch in the target MIDI audios, the voice data after being modified tone.

It is understood that the method that the language data process module 320 obtains the initial frequency-domain parameter of voice data Including：

Null offset and preemphasis processing are carried out to the voice data；

In the present embodiment, the language data process module 320 is to the language by going null offset and preemphasis processing The step of sound data progress time-frequency convert, includes：

The frame for calculating each frame in the voice data moves；

In the present embodiment, the language data process module 320 calculates the frame shifting of each frame in the voice data Step includes：

In the present embodiment, the target MIDI audio recordings have the target frequency of sound, modified tone 340 basis of module The method that the frequency domain parameter of the target MIDI audios modifies to the frequency domain parameter of the voice data includes：

Extract the phase value of the target waveform；

Target waveform corresponding with voice data, and target wave are generated according to target MIDI audios in the embodiment of the present application Shape is that the pitch information included based on target MIDI audios is generated, and the phase value for reusing target waveform replaces voice number The phase value of voice in.So that the frequency domain parameter of voice data is modified to frequency domain ginseng corresponding with target MIDI audios Number makes voice data have the pitch parameters of target MIDI audios, realizes the modified tone processing to voice data.The embodiment of the present application By the replacement of phase value, not by the phase value zero setting of voice data, while modified tone is realized, phase can be avoided the occurrence of The situation of the discontinuous and mechanical sound in position.Simultaneously by using replacement of the target waveform to voice data phase value so that after modified tone Voice data can have target waveform sound effect, make the voice after modified tone have target waveform tone color property.

In conclusion modify by using the frequency domain parameter of target MIDI audios to the frequency domain parameter of voice data, It can make the voice in voice data that there is the frequency domain parameter of target MIDI audios, allow the voice data after modified tone that there is mesh Mark the pitch parameters of MIDI audios, realize the modified tone operation to voice data, can realize do not change in voice data word speed and In the case of voice duration, modify tone to voice data.The Phase Continuation of voice data after modified tone is not in noise, Mechanical sound can be avoided the occurrence of simultaneously, and modified tone effect is more preferable.The amendment of pitch or voice be can be applied in song to song Conversion etc., it is with high application prospect in acoustic processing field.

This method is improved on traditional modified tone algorithm based on zero phase, by adding in same frequency The corresponding phase value of waveform improves situation of the phase discontinuously with mechanical sound.Addition waveform is added on primitive sound simultaneously Some timbre informations, thus can obtain different modified tones as a result, increasing the more of modified tone by adding different waveform Sample.Each user can be made to obtain personalized modified tone as a result, having allowing by way of the free waveform of user in the application Preferable practicality background.This method compared to it is traditional had based on the method for zero phase to the situation of mechanical sound preferably change Kind, compared to traditional time domain approach, there has also been be more obviously improved on phase continuity.

Method provided by the embodiments of the present application can also be combined with changing speed of sound method, and can be combined audio mixing technology and be incited somebody to action Dry sound after modified tone is implemented in combination with being automatically synthesized for song with background pleasure.Since the modified tone algorithm in this method can realize individual character The modified tone of change, so this method can be achieved on personalized song synthesis in song synthesis.Different addition waves can be passed through Shape exports different song to be controlled to synthesize, and waveform is optional for user, and such user can select not according to the hobby of oneself Same effect increases the practicability of this method.

In several embodiments provided herein, it should be understood that disclosed device and method can also pass through Other modes are realized.The apparatus embodiments described above are merely exemplary, for example, flow chart and block diagram in attached drawing Show the device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, Function and operation.In this regard, each box in flow chart or block diagram can represent the one of a module, program segment or code Part, a part for the module, program segment or code include one or more and are used to implement holding for defined logic function Row instruction.It should also be noted that at some as in the realization method replaced, the function that is marked in box can also be to be different from The sequence marked in attached drawing occurs.For example, two continuous boxes can essentially perform substantially in parallel, they are sometimes It can perform in the opposite order, this is depended on the functions involved.It is it is also noted that every in block diagram and/or flow chart The combination of a box and the box in block diagram and/or flow chart can use function or the dedicated base of action as defined in performing It realizes or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.

In addition, each function module in each embodiment of the present invention can integrate to form an independent portion Point or modules individualism, can also two or more modules be integrated to form an independent part.

If the function is realized in the form of software function module and is independent product sale or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially in other words The part contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, is used including some instructions so that a computer equipment (can be People's computer, server or network equipment etc.) perform all or part of the steps of the method according to each embodiment of the present invention. And aforementioned storage medium includes：USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.It needs Illustrate, herein, relational terms such as first and second and the like be used merely to by an entity or operation with Another entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this realities The relationship or sequence on border.Moreover, term " comprising ", "comprising" or its any other variant are intended to the packet of nonexcludability Contain so that process, method, article or equipment including a series of elements not only include those elements, but also including It other elements that are not explicitly listed or further includes as elements inherent to such a process, method, article, or device. In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element Process, method, also there are other identical elements in article or equipment.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, that is made any repaiies Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.It should be noted that：Similar label and letter exists Similar terms are represented in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, are then not required in subsequent attached drawing It is further defined and is explained.

The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in change or replacement, should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention described should be subject to the protection scope in claims.

Claims

1. a kind of voice data processing method, which is characterized in that including：

Voice data and target MIDI audios are obtained, the voice data includes the language after being aligned with the target MIDI audios Sound；

Obtain the initial frequency-domain parameter of the voice data；

Target frequency domain parameter corresponding with preset target MIDI audios is obtained, wherein the initial frequency-domain parameter includes institute's predicate The initial phase of sound data, the target frequency domain parameter include target phase corresponding with the target MIDI audios；

It is modified according to the target frequency domain parameter to the initial frequency-domain parameter, the pitch in the voice data is converted To the target pitch in the target MIDI audios, the voice data after being modified tone.

2. voice data processing method according to claim 1, which is characterized in that obtain the initial frequency-domain ginseng of voice data Several steps include：

To by null offset and the voice data of preemphasis processing is gone to carry out time-frequency convert, obtaining each frame of the voice data Frequency domain parameter.

3. voice data processing method according to claim 2, which is characterized in that by going null offset and preemphasis The step of voice data progress time-frequency convert of processing, includes：

The frame for calculating each frame in the voice data moves；

Each frame voice data after framing, adding window is subjected to Fourier transformation, obtains the frequency of each frame in the voice data Field parameter.

4. voice data processing method according to claim 3, which is characterized in that calculate each frame in the voice data Frame move the step of include：

The frame that each frame is obtained using sample rate divided by target frequency is moved, wherein the target frequency is the target MIDI audios Frequency, target frequency is calculated using the following formula：

Wherein, F is the target frequency of the target MIDI audios, and MIDINote is the pitch that the target MIDI audios include Value.

5. voice data processing method according to claim 1, which is characterized in that the target MIDI audio recordings are sound The step of target frequency of sound, acquisition target frequency domain parameter corresponding with preset target MIDI audios, includes：

Generation pitch identical with the target frequency, and the target wave of the durations such as voice data corresponding with the target frequency Shape；

Correspondingly, the step modified according to the frequency domain parameter of the target MIDI audios to the frequency domain parameter of the voice data Suddenly include：

In the voice data target waveform will be replaced with the phase value of the voice data of the target waveform corresponding position Phase value, the frequency domain parameter of the voice data after being modified tone；

Inverse Fourier transform is carried out to the frequency domain parameter of the voice data after modified tone, and after being overlapped superposition algorithm processing by OLA Voice data after being modified tone.

6. a kind of voice data processing apparatus, which is characterized in that including：

Data acquisition module, for obtaining voice data and target MIDI audios, the voice data includes and the target Voice after the alignment of MIDI audios；

Target MIDI audio processing modules obtain target frequency domain parameter corresponding with preset target MIDI audios, wherein described Initial frequency-domain parameter includes the initial phase of the voice data, and the target frequency domain parameter includes and the target MIDI audios Corresponding target phase；

Modify tone module, for being modified according to the target frequency domain parameter to the initial frequency-domain parameter, by the voice number Pitch in transforms to the target pitch in the target MIDI audios, the voice data after being modified tone.

7. voice data processing apparatus according to claim 6, which is characterized in that the language data process module obtains The method of the initial frequency-domain parameter of voice data includes：

Null offset and preemphasis processing are carried out to the voice data；

8. voice data processing apparatus according to claim 7, which is characterized in that the language data process module is to warp The step of voice data of past null offset and preemphasis processing carries out time-frequency convert includes：

The frame for calculating each frame in the voice data moves；

9. voice data processing apparatus according to claim 7, which is characterized in that the language data process module calculates The step of frame of each frame moves in the voice data includes：

Wherein F is the target frequency of the target MIDI audios, and MIDINote is the pitch value that the target MIDI audios include.

10. voice data processing apparatus according to claim 6, which is characterized in that the target MIDI audio recordings have The target frequency of sound, the target MIDI audio processing modules obtain target frequency domain corresponding with preset target MIDI audios The method of parameter includes：

Correspondingly, the modified tone module according to the frequency domain parameters of the target MIDI audios to the frequency domain parameter of the voice data The method modified includes：

11. a kind of electronic equipment, which is characterized in that the electronic equipment includes：Processor and memory, the memory coupling The processor is connected to, the memory store instruction makes the electronic equipment when executed by the processor Perform following operate：

Obtain the initial frequency-domain parameter of the voice data；

12. a kind of readable storage medium storing program for executing, the readable storage medium storing program for executing includes computer program, which is characterized in that the computer Voice in electronic equipment perform claim requirement 1-5 program controls the readable storage medium storing program for executing when running where described in any one Data processing method.