Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Embodiment of the method
Referring to Fig.1, a kind of step flow chart of method of speech processing embodiment of the invention is shown, can specifically include
Following steps:
Step 101 obtains source audio data stream, and the source audio data stream is formed by the voice data acquired in real time;
Step 102 carries out acoustic feature extraction to the source audio data stream, corresponding to obtain the source audio data stream
Source acoustic feature;
Step 103, according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target
The target speech data stream of acoustic feature;Wherein, the target acoustical feature and the source acoustic feature include identical voice
Content and different tamber characteristics.
The method of speech processing of the embodiment of the present invention can be used for during acquiring voice data, carry out to voice data real
Shi Biansheng, for example, target speech data stream can be exported during acquisition user A speaks the source audio data stream of generation,
Target speech data stream still includes the content that user A speaks, but sound but becomes the sound of user B.
The method of speech processing can run on electronic equipment, and the electronic equipment includes but is not limited to: server, intelligence
Mobile phone, tablet computer, E-book reader, MP3 (dynamic image expert's compression standard audio level 3, Moving Picture
Experts Group Audio Layer III) player, MP4 (dynamic image expert's compression standard audio level 4, Moving
Picture Experts Group Audio Layer IV) player, pocket computer on knee, vehicle-mounted computer, desk-top meter
Calculation machine, set-top box, intelligent TV set, wearable device etc..
The embodiment of the present invention can pass through the microphone collecting voice data in real time of the connection electronic equipment, the voice
Data can be the sound for the sendings such as speak or sing.The voice data constantly acquired can form source audio data stream, adopt
The sample frequency for collecting voice data can be customized as needed, and such as 40 times per second.
In order to increase the interest of user's life, to meet the diversified demand of user, the embodiment of the present invention can be to adopting
The voice data of the user of collection carries out voice change process, and in order to further increase the efficiency of the change of voice, the embodiment of the present invention is being adopted
During collecting voice data generation source audio data stream, Stream Processing, the Stream Processing are carried out to the voice data of acquisition
Refer to the source audio data stream first generated, directly progress voice change process, source language is being generated to the source audio data stream of rear generation
Also voice change process is directly carried out after sound data flow, is all recorded and is completed without waiting for all voice data.
As a result, through the embodiment of the present invention, source voice number can be acquired in real time on one side during acquiring voice data
According to stream, the source acoustic feature of one side extraction source audio data stream, and according to the source acoustic feature of extraction, by the source voice of acquisition
Data flow is converted to target speech data stream in real time, wherein the target acoustical feature includes identical with the source acoustic feature
Voice content and different tamber characteristics.
In an alternative embodiment of the invention, the tamber characteristic at least may include any one in following feature
Kind: fundamental frequency, frequency spectrum and word speed.Wherein, fundamental frequency refers to the frequency of fundamental tone, in that case it can be decided that the pitch of entire sound.Frequency spectrum is that frequency spectrum is close
The abbreviation of degree is the distribution curve of frequency.Word speed refers to people using the lexical representation or propagation with propagation or communication meaning
When information, the included size of vocabulary in the unit time.In practical applications, everyone sound is different, is because of everyone
With different tamber characteristics namely everyone when speaking with tamber characteristics such as different fundamental frequencies, frequency spectrum and word speeds.
For example, the embodiment of the present invention can acquire user A on one side and speak the source language of generation during user A speaks
Sound data flow on one side converts the tamber characteristic in the source audio data stream of acquisition, obtains target speech data stream, target
The content for keeping user A to speak in audio data stream is constant, but tamber characteristic changes, so that target speech data stream is listened
Getting up is the sound of user B, and then the real-time change of voice may be implemented, and greatlys improve change of voice efficiency.
In an alternative embodiment of the invention, described that acoustic feature extraction is carried out to the source audio data stream, with
The corresponding source acoustic feature of the source audio data stream is obtained, can specifically include:
Step S11, sub-frame processing is carried out to the source audio data stream, to obtain the corresponding language of the source audio data stream
Sound frame sequence;
Step S12, acoustic feature extraction successively is carried out to the speech frame in the voice frame sequence, to obtain the voice
The corresponding source acoustic feature of frame;Wherein, the source acoustic feature includes: the corresponding source voice content of the speech frame, Yi Jisuo
State the corresponding source tamber characteristic of speech frame.
In embodiments of the present invention, it can be moved according to preset window is long with frame, framing is carried out to source audio data stream
Source audio data stream cutting is multiple speech frames, obtains the corresponding voice frame sequence of the source audio data stream by processing;Its
In, each speech frame can be a sound bite, and then can be handled frame by frame the source audio data stream.
The long duration that can be used for indicating each speech frame of the window, frame move can be used for indicating between adjacent speech frame when
Difference.For example, when it is 15ms that a length of 25ms of window, frame, which move, first speech frame is 0~25ms, second speech frame be 15~
40ms, and so on, the sub-frame processing to source audio data stream may be implemented.It is appreciated that specific window is long and frame move can be with
Sets itself according to actual needs, the embodiments of the present invention are not limited thereto.
It, can be to the successively carry out sound of the speech frame in voice frame sequence after carrying out sub-frame processing to source audio data stream
Feature extraction is learned, to obtain the corresponding source acoustic feature of the speech frame.It specifically, can be using sound that is existing or occurring in the future
Learn the extraction that feature extracting method carries out acoustic feature.Acoustic feature may include MFCC (Mel-Frequency Cepstrum
Coefficient, mel-frequency cepstrum coefficient) feature etc..In general, these features can be the vector of multidimensional, and its
Value can be discrete or continuous.
In embodiments of the present invention, it can specifically include in the source acoustic feature: the corresponding source voice of the speech frame
Content and the corresponding source tamber characteristic of the speech frame.Wherein, the source voice content may include: source audio data stream
In speech content, singing content etc., therefore, the embodiment of the present invention can keep the source voice content in the audio data stream of source not
Become, the corresponding source tamber characteristic of the source voice content is only converted into target tamber characteristic, to realize to source audio data stream
The real-time change of voice.
In an alternative embodiment of the invention, described according to the source acoustic feature, successively by the source voice of acquisition
Data flow is converted to the target speech data stream with target acoustical feature in real time, can specifically include:
Step S21, for the speech frame in the voice frame sequence, successively from the corresponding source acoustic feature of the speech frame
Middle extraction source voice content;
Step S22, according to target tamber characteristic and the source voice content of extraction, the corresponding target of the speech frame is generated
Acoustic feature;Wherein, the target tamber characteristic is different from the source tamber characteristic;
Step S23, speech synthesis is carried out to the corresponding target acoustical feature of the speech frame, to obtain target speech data
Stream.
In embodiments of the present invention, voice data the change of voice occur by source acoustic feature be converted to target acoustical feature this
During one.Specifically, the embodiment of the present invention is successively corresponding from the speech frame for the speech frame in the voice frame sequence
Source acoustic feature in extraction source voice content then according to target tamber characteristic and the source voice content of extraction generate institute
The corresponding target acoustical feature of speech frame is stated, the target tamber characteristic is different from the source tamber characteristic, so that it may be used
The sound of family B is in the target acoustical feature for saying content identical with user A.Voice finally is carried out to the target acoustical feature
Synthesis, available target speech data stream, namely input be user A sound of speaking and speech content, and export be
It is still the speech content of user A, but is the sound of speaking of user B.
It is appreciated that the embodiment of the present invention can use phoneme synthesizing method that is existing or occurring in the future, to institute's predicate
The corresponding target acoustical feature of sound frame carries out speech synthesis.For example, the target acoustical feature can be reverted to corresponding wave
Shape obtains target speech data stream by the synthetic method of waveform concatenation.
In an alternative embodiment of the invention, acoustics spy successively is being carried out to the speech frame in the voice frame sequence
Sign is extracted, and after obtaining the corresponding source acoustic feature of the speech frame, the method can also include:
Step S31, the corresponding source acoustic feature of speech frame in the voice frame sequence is sequentially input into acoustic model, with defeated
The corresponding acoustic states probability of the speech frame out;
Step S32, the corresponding acoustic states probability of the speech frame is recorded;
It is described according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical
The target speech data stream of feature, can specifically include:
By the corresponding source acoustic feature of speech frame and acoustics state probability in the voice frame sequence, speech synthesis net is inputted
Network, the source acoustic feature of the speech frame is converted to target acoustical feature by the speech synthesis network, and to institute
Target acoustical feature is stated to be synthesized to obtain target speech data stream;Wherein, the speech synthesis network is by the speech frame
Source acoustic feature be converted to target acoustical feature during, according to the target acoustical feature and acoustics shape of previous speech frame
State probability calculates the target acoustical feature of current speech frame.
Acoustic feature extraction successively is being carried out to the speech frame in the voice frame sequence, it is corresponding to obtain the speech frame
After the acoustic feature of source, the corresponding source acoustic feature of speech frame in the voice frame sequence can be sequentially input into acoustic model,
To export the corresponding acoustic states probability of the speech frame.
Wherein, acoustic model may indicate that the relationship between speech frame and acoustic states probability.It is understood that in language
It can also include other processes during sound identifies, for example, after obtaining acoustic states probability, it can also be according to language mould
Speech frame is converted to text by type, pronunciation dictionary etc., to complete final speech recognition.
In general, speech synthesis is that text information is converted to voice messaging.It specifically, can be by be converted for voice messaging
Text information input speech synthesis network, speech synthesis network predicts the text information of input, obtains the text
The state duration information of the corresponding acoustic feature of information and the acoustic states for being included, and then can be according to the acoustic feature of prediction
With the state duration information for the acoustic states for being included, synthesis obtains the corresponding voice messaging of the text information.
However, in order to realize the real-time change of voice to voice data, the embodiment of the present invention can be to the source voice acquired in real time
Data flow carries out speech recognition, is obtaining the corresponding acoustic feature harmony of the first frame speech frame in the source audio data stream
After learning state probability, the corresponding source acoustic feature of the speech frame and acoustics state probability can be inputted speech synthesis network,
To export the corresponding target speech data stream of the speech frame, speech frame is converted into text without carrying out subsequent, and to text
The operation such as this prediction acoustic feature and the state duration information for the acoustic states for being included, it is complete without the voice data in user
Portion is converted again after the completion of recording, and speech synthesis network is allowed directly to use the source acoustic feature of extracted speech frame
And the acoustic states recorded, so that speech synthesis network can support the Stream Processing to speech frame, that is, connecing
The change of voice can be carried out to the speech frame when receiving the first frame speech frame of user, and play the voice after the change of voice, may be implemented
User speaks on one side, plays the user voice after the change of voice on one side;The real-time change of voice not only may be implemented through the embodiment of the present invention, be
User, which lives, increases interest, to meet the diversified demand of user, and the operating process of change of voice process can be reduced, into one
Step improves the efficiency of the change of voice.
Further, since the not individual frequency of voice, there are many simple harmonic oscillations of frequency to be formed by stacking, and simple harmonic quantity shakes
Dynamic multiple frequencies form the different wave crest of multiple amplitudes by superposition.That is, each speech frame is previous with it
And latter voice frame has incidence relation, handles if speech frame is individually isolated to come, it will cause at voice
Occurs error during reason.Therefore, in embodiments of the present invention, the speech synthesis network is by the source acoustics of the speech frame
During Feature Conversion is target acoustical feature, according to the target acoustical feature and acoustics state probability of previous speech frame,
The target acoustical feature of current speech frame is calculated, that is, the status information h (t) of current speech frame is dependent on current speech frame
The status information h (t-1) of acoustic feature x (t) and previous speech frame are right as a result, according to the context-related information of speech frame
Speech frame is handled, and the accuracy of speech recognition and speech synthesis process can be improved.
The embodiment of the present invention will can collect in real time during carrying out the real-time change of voice to the voice data of acquisition
Source audio data stream input speech recognition network frame by frame, speech recognition network carries out voice knowledge to the speech frame sequentially input
Not, the source acoustic feature and state for obtaining the source acoustic feature and status information of speech frame, and speech recognition network being exported
Information input speech synthesis network, speech synthesis network carry out voice change process frame by frame, and source acoustic feature is converted to target by output
The speech frame of acoustic feature, to obtain target speech data stream.
The speech recognition network and/or the speech synthesis network can merge a variety of neural networks.The nerve net
Network includes but is not limited to following at least one combination, superposition, nesting: CNN (Convolutional Neural Network,
Convolutional neural networks), LSTM (Long Short-Term Memory, long short-term memory) network, RNN (Simple
Recurrent Neural Network, Recognition with Recurrent Neural Network), attention neural network etc..It is appreciated that the embodiment of the present invention
It is without restriction to the type and training method of the speech recognition network and the speech synthesis network.
To sum up, the embodiment of the present invention, can during obtaining the source audio data stream that collecting voice data in real time is formed
To carry out acoustic feature extraction to the source audio data stream, to obtain the corresponding source acoustic feature of the source audio data stream,
And then successively the source audio data stream of acquisition can be converted in real time with target acoustical feature according to the source acoustic feature
Target speech data stream;Wherein, the target acoustical feature and the source acoustic feature are comprising identical voice content and not
Same tamber characteristic.As a result, through the embodiment of the present invention, source audio data stream can be acquired in real time on one side, on one side by acquisition
Source audio data stream is converted to target speech data stream in real time, since the target acoustical feature and the source acoustic feature include
Identical voice content and different tamber characteristics.Therefore, the embodiment of the present invention can during user A speaks on one side,
User's A one's voice in speech is converted into the sound of user B on one side, but the content that user A speaks remains unchanged, and then can be real
Change of voice when real greatlys improve change of voice efficiency.And it can live for user and increase interest, it is diversified to meet user
Demand.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method
It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to
According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should
Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented
Necessary to example.
Installation practice
Referring to Fig. 2, a kind of structural block diagram of voice processing apparatus embodiment of the invention is shown, described device specifically may be used
To include:
Data acquisition module 201, for obtaining source audio data stream, the source audio data stream is by the voice that acquires in real time
Data are formed;
Characteristic extracting module 202, for carrying out acoustic feature extraction to the source audio data stream, to obtain the source language
The corresponding source acoustic feature of sound data flow;
Data conversion module 203, for successively turning the source audio data stream of acquisition in real time according to the source acoustic feature
It is changed to the target speech data stream with target acoustical feature;Wherein, the target acoustical feature and the source acoustic feature packet
Containing identical voice content and different tamber characteristics.
Optionally, the characteristic extracting module 202, can specifically include:
Framing submodule, for carrying out sub-frame processing to the source audio data stream, to obtain the source audio data stream
Corresponding voice frame sequence;
Feature extraction submodule, for successively carrying out acoustic feature extraction to the speech frame in the voice frame sequence, with
Obtain the corresponding source acoustic feature of the speech frame;Wherein, the source acoustic feature includes: the corresponding source voice of the speech frame
Content and the corresponding source tamber characteristic of the speech frame.
Optionally, the data conversion module 203, can specifically include:
Contents extraction submodule, for successively being corresponded to from the speech frame for the speech frame in the voice frame sequence
Source acoustic feature in extraction source voice content;
Feature Conversion submodule generates the voice for the source voice content according to target tamber characteristic and extraction
The corresponding target acoustical feature of frame;Wherein, the target tamber characteristic is different from the source tamber characteristic;
Speech synthesis submodule, for carrying out speech synthesis to the corresponding target acoustical feature of the speech frame, to obtain
Target speech data stream.
Optionally, described device can also include:
State determining module, for the corresponding source acoustic feature of speech frame in the voice frame sequence to be sequentially input acoustics
Model, to export the corresponding acoustic states probability of the speech frame;
State recording module, for recording the corresponding acoustic states probability of the speech frame;
The data conversion module 203, can specifically include:
Data conversion submodule is used for the corresponding source acoustic feature of speech frame and acoustic states in the voice frame sequence
Probability inputs speech synthesis network, the source acoustic feature of the speech frame is converted to mesh by the speech synthesis network
Acoustic feature is marked, and the target acoustical feature is synthesized to obtain target speech data stream;Wherein, the speech synthesis
Network is during being converted to target acoustical feature for the source acoustic feature of the speech frame, according to the mesh of previous speech frame
Acoustic feature and acoustics state probability are marked, the target acoustical feature of current speech frame is calculated.
Optionally, the tamber characteristic at least may include any one in following feature: fundamental frequency, frequency spectrum and word speed.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
The embodiment of the invention provides a kind of devices for speech processes, include memory and one or one
A above program, perhaps more than one program is stored in memory and is configured to by one or one for one of them
It includes the instruction for performing the following operation that the above processor, which executes the one or more programs: obtaining source voice number
According to stream, the source audio data stream is formed by the voice data acquired in real time;Acoustic feature is carried out to the source audio data stream
It extracts, to obtain the corresponding source acoustic feature of the source audio data stream;According to the source acoustic feature, successively by the source of acquisition
Audio data stream is converted to the target speech data stream with target acoustical feature in real time;Wherein, the target acoustical feature with
The source acoustic feature includes identical voice content and different tamber characteristics.
Fig. 3 is a kind of block diagram of device 800 for speech processes shown according to an exemplary embodiment.For example, dress
Setting 800 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical treatment
Equipment, body-building equipment, personal digital assistant etc..
Referring to Fig. 3, device 800 may include following one or more components: processing component 802, memory 804, power supply
Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, and
Communication component 816.
The integrated operation of the usual control device 800 of processing component 802, such as with display, telephone call, data communication, phase
Machine operation and record operate associated operation.Processing element 802 may include that one or more processors 820 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more modules, just
Interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, it is more to facilitate
Interaction between media component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown
Example includes the instruction of any application or method for operating on device 800, contact data, and telephone book data disappears
Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group
It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile
Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 may include power management system
System, one or more power supplys and other with for device 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between described device 800 and user.One
In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen
Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings
Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action
Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers
Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, such as screening-mode or
When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and
Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike
Wind (MIC), when device 800 is in operation mode, when such as call model, logging mode and speech signal analysis mode, microphone
It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical
Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock
Determine button.
Sensor module 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented
Estimate.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, and the relative positioning of component, for example, it is described
Component is the display and keypad of device 800, and sensor module 814 can be with 800 1 components of detection device 800 or device
Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800
Temperature change.Sensor module 814 may include proximity sensor, be configured to detect without any physical contact
Presence of nearby objects.Sensor module 814 can also include optical sensor, such as CMOS or ccd image sensor, at
As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors
Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device
800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation
In example, communication component 816 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.
In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, to promote short range communication.Example
Such as, (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) skill can be handled based on radio-frequency information in NFC module
Art, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuit (ASIC), number
Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of device 800 to complete the above method.For example,
The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk
With optical data storage devices etc..
Fig. 4 is the structural schematic diagram of server in some embodiments of the present invention.The server 1900 can be because of configuration or property
Energy is different and generates bigger difference, may include one or more central processing units (central processing
Units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage applications
The storage medium 1930 (such as one or more mass memory units) of program 1942 or data 1944.Wherein, memory
1932 and storage medium 1930 can be of short duration storage or persistent storage.The program for being stored in storage medium 1930 may include one
A or more than one module (diagram does not mark), each module may include to the series of instructions operation in server.More into
One step, central processing unit 1922 can be set to communicate with storage medium 1930, execute storage medium on server 1900
Series of instructions operation in 1930.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets
Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or
More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM
Etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (server or
Person's terminal) processor execute when, enable a device to execute method of speech processing shown in FIG. 1.
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (server or
Person's terminal) processor execute when, enable a device to execute a kind of method of speech processing, which comprises obtain source language
Sound data flow, the source audio data stream are formed by the voice data acquired in real time;Acoustics is carried out to the source audio data stream
Feature extraction, to obtain the corresponding source acoustic feature of the source audio data stream;It, successively will acquisition according to the source acoustic feature
Source audio data stream be converted to the target speech data stream with target acoustical feature in real time;Wherein, the target acoustical is special
Sign includes identical voice content and different tamber characteristics from the source acoustic feature.
The embodiment of the invention discloses A1, a kind of method of speech processing, comprising:
Acquisition source audio data stream, the source audio data stream are formed by the voice data acquired in real time;
Acoustic feature extraction is carried out to the source audio data stream, to obtain the corresponding source acoustics of the source audio data stream
Feature;
According to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical feature
Target speech data stream;Wherein, the target acoustical feature and the source acoustic feature are comprising identical voice content and not
Same tamber characteristic.
A2, method according to a1, it is described that acoustic feature extraction is carried out to the source audio data stream, it is described to obtain
The corresponding source acoustic feature of source audio data stream, comprising:
Sub-frame processing is carried out to the source audio data stream, to obtain the corresponding speech frame sequence of the source audio data stream
Column;
Acoustic feature extraction successively is carried out to the speech frame in the voice frame sequence, it is corresponding to obtain the speech frame
Source acoustic feature;Wherein, the source acoustic feature includes: the corresponding source voice content of the speech frame and the speech frame
Corresponding source tamber characteristic.
A3, the method according to A2, it is described according to the source acoustic feature, it is successively that the source audio data stream of acquisition is real
When be converted to the target speech data stream with target acoustical feature, comprising:
For the speech frame in the voice frame sequence, the successively extraction source from the speech frame corresponding source acoustic feature
Voice content;
According to target tamber characteristic and the source voice content of extraction, it is special to generate the corresponding target acoustical of the speech frame
Sign;Wherein, the target tamber characteristic is different from the source tamber characteristic;
Speech synthesis is carried out to the corresponding target acoustical feature of the speech frame, to obtain target speech data stream.
A4, the method according to A2 are successively carrying out acoustic feature extraction to the speech frame in the voice frame sequence,
After obtaining the corresponding source acoustic feature of the speech frame, the method also includes:
The corresponding source acoustic feature of speech frame in the voice frame sequence is sequentially input into acoustic model, the predicate to export
The corresponding acoustic states probability of sound frame;
Record the corresponding acoustic states probability of the speech frame;
It is described according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical
The target speech data stream of feature, comprising:
By the corresponding source acoustic feature of speech frame and acoustics state probability in the voice frame sequence, speech synthesis net is inputted
Network, the source acoustic feature of the speech frame is converted to target acoustical feature by the speech synthesis network, and to institute
Target acoustical feature is stated to be synthesized to obtain target speech data stream;
Wherein, the speech synthesis network is converted to the mistake of target acoustical feature in the source acoustic feature by the speech frame
Cheng Zhong, according to the target acoustical feature and acoustics state probability of previous speech frame, the target acoustical for calculating current speech frame is special
Sign.
A5, according to A1, into A4, any method, the tamber characteristic include at least any one in following feature
Kind: fundamental frequency, frequency spectrum and word speed.
The embodiment of the invention discloses B6, a kind of voice processing apparatus, comprising:
Data acquisition module, for obtaining source audio data stream, the source audio data stream is by the voice number that acquires in real time
According to formation;
Characteristic extracting module, for carrying out acoustic feature extraction to the source audio data stream, to obtain the source voice
The corresponding source acoustic feature of data flow;
Data conversion module, for successively converting the source audio data stream of acquisition in real time according to the source acoustic feature
For the target speech data stream with target acoustical feature;Wherein, the target acoustical feature includes with the source acoustic feature
Identical voice content and different tamber characteristics.
B7, the device according to B6, the characteristic extracting module, comprising:
Framing submodule, for carrying out sub-frame processing to the source audio data stream, to obtain the source audio data stream
Corresponding voice frame sequence;
Feature extraction submodule, for successively carrying out acoustic feature extraction to the speech frame in the voice frame sequence, with
Obtain the corresponding source acoustic feature of the speech frame;Wherein, the source acoustic feature includes: the corresponding source voice of the speech frame
Content and the corresponding source tamber characteristic of the speech frame.
B8, the device according to B7, the data conversion module, comprising:
Contents extraction submodule, for successively being corresponded to from the speech frame for the speech frame in the voice frame sequence
Source acoustic feature in extraction source voice content;
Feature Conversion submodule generates the voice for the source voice content according to target tamber characteristic and extraction
The corresponding target acoustical feature of frame;Wherein, the target tamber characteristic is different from the source tamber characteristic;
Speech synthesis submodule, for carrying out speech synthesis to the corresponding target acoustical feature of the speech frame, to obtain
Target speech data stream.
B9, the device according to B7, described device further include:
State determining module, for the corresponding source acoustic feature of speech frame in the voice frame sequence to be sequentially input acoustics
Model, to export the corresponding acoustic states probability of the speech frame;
State recording module, for recording the corresponding acoustic states probability of the speech frame;
The data conversion module, comprising:
Data conversion submodule is used for the corresponding source acoustic feature of speech frame and acoustic states in the voice frame sequence
Probability inputs speech synthesis network, the source acoustic feature of the speech frame is converted to mesh by the speech synthesis network
Acoustic feature is marked, and the target acoustical feature is synthesized to obtain target speech data stream;Wherein, the speech synthesis
Network is during being converted to target acoustical feature for the source acoustic feature of the speech frame, according to the mesh of previous speech frame
Acoustic feature and acoustics state probability are marked, the target acoustical feature of current speech frame is calculated.
B10, according to B6, into B9, any device, the tamber characteristic include at least any one in following feature
Kind: fundamental frequency, frequency spectrum and word speed.
The embodiment of the invention discloses C11, a kind of device for speech processes, include memory and one or
The more than one program of person, one of them perhaps more than one program be stored in memory and be configured to by one or
It includes the instruction for performing the following operation that more than one processor, which executes the one or more programs:
Acquisition source audio data stream, the source audio data stream are formed by the voice data acquired in real time;
Acoustic feature extraction is carried out to the source audio data stream, to obtain the corresponding source acoustics of the source audio data stream
Feature;
According to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical feature
Target speech data stream;Wherein, the target acoustical feature and the source acoustic feature are comprising identical voice content and not
Same tamber characteristic.
C12, the device according to C11, it is described that acoustic feature extraction is carried out to the source audio data stream, to obtain
State the corresponding source acoustic feature of source audio data stream, comprising:
Sub-frame processing is carried out to the source audio data stream, to obtain the corresponding speech frame sequence of the source audio data stream
Column;
Acoustic feature extraction successively is carried out to the speech frame in the voice frame sequence, it is corresponding to obtain the speech frame
Source acoustic feature;Wherein, the source acoustic feature includes: the corresponding source voice content of the speech frame and the speech frame
Corresponding source tamber characteristic.
C13, the device according to C12, it is described according to the source acoustic feature, successively by the source audio data stream of acquisition
The target speech data stream with target acoustical feature is converted in real time, comprising:
For the speech frame in the voice frame sequence, the successively extraction source from the speech frame corresponding source acoustic feature
Voice content;
According to target tamber characteristic and the source voice content of extraction, it is special to generate the corresponding target acoustical of the speech frame
Sign;Wherein, the target tamber characteristic is different from the source tamber characteristic;
Speech synthesis is carried out to the corresponding target acoustical feature of the speech frame, to obtain target speech data stream.
C14, the device according to C12, described device are also configured to by one or the execution of more than one processor
The one or more programs include the instruction for performing the following operation:
The corresponding source acoustic feature of speech frame in the voice frame sequence is sequentially input into acoustic model, the predicate to export
The corresponding acoustic states probability of sound frame;
Record the corresponding acoustic states probability of the speech frame;
It is described according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical
The target speech data stream of feature, comprising:
By the corresponding source acoustic feature of speech frame and acoustics state probability in the voice frame sequence, speech synthesis net is inputted
Network, the source acoustic feature of the speech frame is converted to target acoustical feature by the speech synthesis network, and to institute
Target acoustical feature is stated to be synthesized to obtain target speech data stream;Wherein, the speech synthesis network is by the speech frame
Source acoustic feature be converted to target acoustical feature during, according to the target acoustical feature and acoustics shape of previous speech frame
State probability calculates the target acoustical feature of current speech frame.
C15, according to C11, into C14, any device, the tamber characteristic include at least any in following feature
It is a kind of: fundamental frequency, frequency spectrum and word speed.
The embodiment of the invention discloses D16, a kind of machine readable media, instruction are stored thereon with, when by one or more
When processor executes, so that device executes the method for speech processing as described in A1 one or more into A5.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its
Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or
Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Above to a kind of method of speech processing provided by the present invention, a kind of voice processing apparatus and a kind of at voice
The device of reason, is described in detail, and specific case used herein explains the principle of the present invention and embodiment
It states, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas;Meanwhile for this field
Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute
It states, the contents of this specification are not to be construed as limiting the invention.