CN110097890A

CN110097890A - A kind of method of speech processing, device and the device for speech processes

Info

Publication number: CN110097890A
Application number: CN201910305630.2A
Authority: CN
Inventors: 阳家俊; 吴军; 刘恺; 魏远明; 孟凡博; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2019-08-06
Anticipated expiration: 2039-04-16
Also published as: CN110097890B

Abstract

The embodiment of the invention provides a kind of method of speech processing, device and for the device of speech processes.Method therein specifically includes: obtaining source audio data stream, the source audio data stream is formed by the voice data acquired in real time；Acoustic feature extraction is carried out to the source audio data stream, to obtain the corresponding source acoustic feature of the source audio data stream；According to the source acoustic feature, the source audio data stream of acquisition is successively converted into the target speech data stream with target acoustical feature in real time；Wherein, the target acoustical feature includes identical voice content and different tamber characteristics from the source acoustic feature.The real-time change of voice not only may be implemented through the embodiment of the present invention, live for user and increase interest, to meet the diversified demand of user, and the operating process of change of voice process can be reduced, further increase the efficiency of the change of voice.

Description

A kind of method of speech processing, device and the device for speech processes

Technical field

The present invention relates to field of computer technology more particularly to a kind of method of speech processing, device and it is used for speech processes Device.

Background technique

With the development of computer technology, voice technology is also rapidly developed, and is obtained in the life of user It is widely applied.

For example, to can be widely applied to speech recognition, machine translation, speech synthesis, dialogue robot etc. various for voice technology Scene brings great convenience for the work and life of user.

However, how to utilize voice technology, user's one's voice in speech is further processed, lives for user and increases interest Taste still has to be solved to meet the diversified demand of user.

Summary of the invention

The embodiment of the present invention provides a kind of method of speech processing, device and the device for speech processes, can be to user Voice data carry out voice change process, to meet the diversified demand of user.

To solve the above-mentioned problems, the embodiment of the invention discloses a kind of method of speech processing, which comprises

Acquisition source audio data stream, the source audio data stream are formed by the voice data acquired in real time；

Acoustic feature extraction is carried out to the source audio data stream, to obtain the corresponding source acoustics of the source audio data stream Feature；

According to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical feature Target speech data stream；Wherein, the target acoustical feature and the source acoustic feature are comprising identical voice content and not Same tamber characteristic.

On the other hand, the embodiment of the invention discloses a kind of voice processing apparatus, described device includes:

Data acquisition module, for obtaining source audio data stream, the source audio data stream is by the voice number that acquires in real time According to formation；

Characteristic extracting module, for carrying out acoustic feature extraction to the source audio data stream, to obtain the source voice The corresponding source acoustic feature of data flow；

Data conversion module, for successively converting the source audio data stream of acquisition in real time according to the source acoustic feature For the target speech data stream with target acoustical feature；Wherein, the target acoustical feature includes with the source acoustic feature Identical voice content and different tamber characteristics.

In another aspect, including memory, Yi Jiyi the embodiment of the invention discloses a kind of device for speech processes A perhaps more than one program one of them or more than one program is stored in memory, and is configured to by one Or it includes the instruction for performing the following operation that more than one processor, which executes the one or more programs:

Another aspect, the embodiment of the invention discloses a kind of machine readable medias, are stored thereon with instruction, when by one or When multiple processors execute, so that device executes the method for speech processing as described in aforementioned one or more.

The embodiment of the present invention includes following advantages:

The embodiment of the present invention, can be to institute during obtaining the source audio data stream that collecting voice data in real time is formed It states source audio data stream and carries out acoustic feature extraction, to obtain the corresponding source acoustic feature of the source audio data stream, Jin Erke The source audio data stream of acquisition is successively converted to the target with target acoustical feature in real time according to the source acoustic feature Audio data stream；Wherein, the target acoustical feature includes identical voice content and different sounds from the source acoustic feature Color characteristic.As a result, through the embodiment of the present invention, source audio data stream can be acquired in real time on one side, on one side by the source voice of acquisition Data flow is converted to target speech data stream in real time, since the target acoustical feature includes identical with the source acoustic feature Voice content and different tamber characteristics.Therefore, the embodiment of the present invention on one side can will during user A speaks on one side User's A one's voice in speech is converted into the sound of user B, but the content that user A speaks remains unchanged.Through the embodiment of the present invention The real-time change of voice not only may be implemented, live for user and increase interest, to meet the diversified demand of user, and can reduce The operating process of change of voice process further increases the efficiency of the change of voice.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is a kind of step flow chart of method of speech processing embodiment of the invention；

Fig. 2 is a kind of structural block diagram of voice processing apparatus embodiment of the invention；

Fig. 3 is a kind of block diagram of device 800 for speech processes of the invention；And

Fig. 4 is the structural schematic diagram of server in some embodiments of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Embodiment of the method

Referring to Fig.1, a kind of step flow chart of method of speech processing embodiment of the invention is shown, can specifically include Following steps:

Step 101 obtains source audio data stream, and the source audio data stream is formed by the voice data acquired in real time；

Step 102 carries out acoustic feature extraction to the source audio data stream, corresponding to obtain the source audio data stream Source acoustic feature；

Step 103, according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target The target speech data stream of acoustic feature；Wherein, the target acoustical feature and the source acoustic feature include identical voice Content and different tamber characteristics.

The method of speech processing of the embodiment of the present invention can be used for during acquiring voice data, carry out to voice data real Shi Biansheng, for example, target speech data stream can be exported during acquisition user A speaks the source audio data stream of generation, Target speech data stream still includes the content that user A speaks, but sound but becomes the sound of user B.

The method of speech processing can run on electronic equipment, and the electronic equipment includes but is not limited to: server, intelligence Mobile phone, tablet computer, E-book reader, MP3 (dynamic image expert's compression standard audio level 3, Moving Picture Experts Group Audio Layer III) player, MP4 (dynamic image expert's compression standard audio level 4, Moving Picture Experts Group Audio Layer IV) player, pocket computer on knee, vehicle-mounted computer, desk-top meter Calculation machine, set-top box, intelligent TV set, wearable device etc..

The embodiment of the present invention can pass through the microphone collecting voice data in real time of the connection electronic equipment, the voice Data can be the sound for the sendings such as speak or sing.The voice data constantly acquired can form source audio data stream, adopt The sample frequency for collecting voice data can be customized as needed, and such as 40 times per second.

In order to increase the interest of user's life, to meet the diversified demand of user, the embodiment of the present invention can be to adopting The voice data of the user of collection carries out voice change process, and in order to further increase the efficiency of the change of voice, the embodiment of the present invention is being adopted During collecting voice data generation source audio data stream, Stream Processing, the Stream Processing are carried out to the voice data of acquisition Refer to the source audio data stream first generated, directly progress voice change process, source language is being generated to the source audio data stream of rear generation Also voice change process is directly carried out after sound data flow, is all recorded and is completed without waiting for all voice data.

As a result, through the embodiment of the present invention, source voice number can be acquired in real time on one side during acquiring voice data According to stream, the source acoustic feature of one side extraction source audio data stream, and according to the source acoustic feature of extraction, by the source voice of acquisition Data flow is converted to target speech data stream in real time, wherein the target acoustical feature includes identical with the source acoustic feature Voice content and different tamber characteristics.

In an alternative embodiment of the invention, the tamber characteristic at least may include any one in following feature Kind: fundamental frequency, frequency spectrum and word speed.Wherein, fundamental frequency refers to the frequency of fundamental tone, in that case it can be decided that the pitch of entire sound.Frequency spectrum is that frequency spectrum is close The abbreviation of degree is the distribution curve of frequency.Word speed refers to people using the lexical representation or propagation with propagation or communication meaning When information, the included size of vocabulary in the unit time.In practical applications, everyone sound is different, is because of everyone With different tamber characteristics namely everyone when speaking with tamber characteristics such as different fundamental frequencies, frequency spectrum and word speeds.

For example, the embodiment of the present invention can acquire user A on one side and speak the source language of generation during user A speaks Sound data flow on one side converts the tamber characteristic in the source audio data stream of acquisition, obtains target speech data stream, target The content for keeping user A to speak in audio data stream is constant, but tamber characteristic changes, so that target speech data stream is listened Getting up is the sound of user B, and then the real-time change of voice may be implemented, and greatlys improve change of voice efficiency.

In an alternative embodiment of the invention, described that acoustic feature extraction is carried out to the source audio data stream, with The corresponding source acoustic feature of the source audio data stream is obtained, can specifically include:

Step S11, sub-frame processing is carried out to the source audio data stream, to obtain the corresponding language of the source audio data stream Sound frame sequence；

Step S12, acoustic feature extraction successively is carried out to the speech frame in the voice frame sequence, to obtain the voice The corresponding source acoustic feature of frame；Wherein, the source acoustic feature includes: the corresponding source voice content of the speech frame, Yi Jisuo State the corresponding source tamber characteristic of speech frame.

In embodiments of the present invention, it can be moved according to preset window is long with frame, framing is carried out to source audio data stream Source audio data stream cutting is multiple speech frames, obtains the corresponding voice frame sequence of the source audio data stream by processing；Its In, each speech frame can be a sound bite, and then can be handled frame by frame the source audio data stream.

The long duration that can be used for indicating each speech frame of the window, frame move can be used for indicating between adjacent speech frame when Difference.For example, when it is 15ms that a length of 25ms of window, frame, which move, first speech frame is 0~25ms, second speech frame be 15~ 40ms, and so on, the sub-frame processing to source audio data stream may be implemented.It is appreciated that specific window is long and frame move can be with Sets itself according to actual needs, the embodiments of the present invention are not limited thereto.

It, can be to the successively carry out sound of the speech frame in voice frame sequence after carrying out sub-frame processing to source audio data stream Feature extraction is learned, to obtain the corresponding source acoustic feature of the speech frame.It specifically, can be using sound that is existing or occurring in the future Learn the extraction that feature extracting method carries out acoustic feature.Acoustic feature may include MFCC (Mel-Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient) feature etc..In general, these features can be the vector of multidimensional, and its Value can be discrete or continuous.

In embodiments of the present invention, it can specifically include in the source acoustic feature: the corresponding source voice of the speech frame Content and the corresponding source tamber characteristic of the speech frame.Wherein, the source voice content may include: source audio data stream In speech content, singing content etc., therefore, the embodiment of the present invention can keep the source voice content in the audio data stream of source not Become, the corresponding source tamber characteristic of the source voice content is only converted into target tamber characteristic, to realize to source audio data stream The real-time change of voice.

In an alternative embodiment of the invention, described according to the source acoustic feature, successively by the source voice of acquisition Data flow is converted to the target speech data stream with target acoustical feature in real time, can specifically include:

Step S21, for the speech frame in the voice frame sequence, successively from the corresponding source acoustic feature of the speech frame Middle extraction source voice content；

Step S22, according to target tamber characteristic and the source voice content of extraction, the corresponding target of the speech frame is generated Acoustic feature；Wherein, the target tamber characteristic is different from the source tamber characteristic；

Step S23, speech synthesis is carried out to the corresponding target acoustical feature of the speech frame, to obtain target speech data Stream.

In embodiments of the present invention, voice data the change of voice occur by source acoustic feature be converted to target acoustical feature this During one.Specifically, the embodiment of the present invention is successively corresponding from the speech frame for the speech frame in the voice frame sequence Source acoustic feature in extraction source voice content then according to target tamber characteristic and the source voice content of extraction generate institute The corresponding target acoustical feature of speech frame is stated, the target tamber characteristic is different from the source tamber characteristic, so that it may be used The sound of family B is in the target acoustical feature for saying content identical with user A.Voice finally is carried out to the target acoustical feature Synthesis, available target speech data stream, namely input be user A sound of speaking and speech content, and export be It is still the speech content of user A, but is the sound of speaking of user B.

It is appreciated that the embodiment of the present invention can use phoneme synthesizing method that is existing or occurring in the future, to institute's predicate The corresponding target acoustical feature of sound frame carries out speech synthesis.For example, the target acoustical feature can be reverted to corresponding wave Shape obtains target speech data stream by the synthetic method of waveform concatenation.

In an alternative embodiment of the invention, acoustics spy successively is being carried out to the speech frame in the voice frame sequence Sign is extracted, and after obtaining the corresponding source acoustic feature of the speech frame, the method can also include:

Step S31, the corresponding source acoustic feature of speech frame in the voice frame sequence is sequentially input into acoustic model, with defeated The corresponding acoustic states probability of the speech frame out；

Step S32, the corresponding acoustic states probability of the speech frame is recorded；

It is described according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical The target speech data stream of feature, can specifically include:

By the corresponding source acoustic feature of speech frame and acoustics state probability in the voice frame sequence, speech synthesis net is inputted Network, the source acoustic feature of the speech frame is converted to target acoustical feature by the speech synthesis network, and to institute Target acoustical feature is stated to be synthesized to obtain target speech data stream；Wherein, the speech synthesis network is by the speech frame Source acoustic feature be converted to target acoustical feature during, according to the target acoustical feature and acoustics shape of previous speech frame State probability calculates the target acoustical feature of current speech frame.

Acoustic feature extraction successively is being carried out to the speech frame in the voice frame sequence, it is corresponding to obtain the speech frame After the acoustic feature of source, the corresponding source acoustic feature of speech frame in the voice frame sequence can be sequentially input into acoustic model, To export the corresponding acoustic states probability of the speech frame.

Wherein, acoustic model may indicate that the relationship between speech frame and acoustic states probability.It is understood that in language It can also include other processes during sound identifies, for example, after obtaining acoustic states probability, it can also be according to language mould Speech frame is converted to text by type, pronunciation dictionary etc., to complete final speech recognition.

In general, speech synthesis is that text information is converted to voice messaging.It specifically, can be by be converted for voice messaging Text information input speech synthesis network, speech synthesis network predicts the text information of input, obtains the text The state duration information of the corresponding acoustic feature of information and the acoustic states for being included, and then can be according to the acoustic feature of prediction With the state duration information for the acoustic states for being included, synthesis obtains the corresponding voice messaging of the text information.

However, in order to realize the real-time change of voice to voice data, the embodiment of the present invention can be to the source voice acquired in real time Data flow carries out speech recognition, is obtaining the corresponding acoustic feature harmony of the first frame speech frame in the source audio data stream After learning state probability, the corresponding source acoustic feature of the speech frame and acoustics state probability can be inputted speech synthesis network, To export the corresponding target speech data stream of the speech frame, speech frame is converted into text without carrying out subsequent, and to text The operation such as this prediction acoustic feature and the state duration information for the acoustic states for being included, it is complete without the voice data in user Portion is converted again after the completion of recording, and speech synthesis network is allowed directly to use the source acoustic feature of extracted speech frame And the acoustic states recorded, so that speech synthesis network can support the Stream Processing to speech frame, that is, connecing The change of voice can be carried out to the speech frame when receiving the first frame speech frame of user, and play the voice after the change of voice, may be implemented User speaks on one side, plays the user voice after the change of voice on one side；The real-time change of voice not only may be implemented through the embodiment of the present invention, be User, which lives, increases interest, to meet the diversified demand of user, and the operating process of change of voice process can be reduced, into one Step improves the efficiency of the change of voice.

Further, since the not individual frequency of voice, there are many simple harmonic oscillations of frequency to be formed by stacking, and simple harmonic quantity shakes Dynamic multiple frequencies form the different wave crest of multiple amplitudes by superposition.That is, each speech frame is previous with it And latter voice frame has incidence relation, handles if speech frame is individually isolated to come, it will cause at voice Occurs error during reason.Therefore, in embodiments of the present invention, the speech synthesis network is by the source acoustics of the speech frame During Feature Conversion is target acoustical feature, according to the target acoustical feature and acoustics state probability of previous speech frame, The target acoustical feature of current speech frame is calculated, that is, the status information h (t) of current speech frame is dependent on current speech frame The status information h (t-1) of acoustic feature x (t) and previous speech frame are right as a result, according to the context-related information of speech frame Speech frame is handled, and the accuracy of speech recognition and speech synthesis process can be improved.

The embodiment of the present invention will can collect in real time during carrying out the real-time change of voice to the voice data of acquisition Source audio data stream input speech recognition network frame by frame, speech recognition network carries out voice knowledge to the speech frame sequentially input Not, the source acoustic feature and state for obtaining the source acoustic feature and status information of speech frame, and speech recognition network being exported Information input speech synthesis network, speech synthesis network carry out voice change process frame by frame, and source acoustic feature is converted to target by output The speech frame of acoustic feature, to obtain target speech data stream.

The speech recognition network and/or the speech synthesis network can merge a variety of neural networks.The nerve net Network includes but is not limited to following at least one combination, superposition, nesting: CNN (Convolutional Neural Network, Convolutional neural networks), LSTM (Long Short-Term Memory, long short-term memory) network, RNN (Simple Recurrent Neural Network, Recognition with Recurrent Neural Network), attention neural network etc..It is appreciated that the embodiment of the present invention It is without restriction to the type and training method of the speech recognition network and the speech synthesis network.

To sum up, the embodiment of the present invention, can during obtaining the source audio data stream that collecting voice data in real time is formed To carry out acoustic feature extraction to the source audio data stream, to obtain the corresponding source acoustic feature of the source audio data stream, And then successively the source audio data stream of acquisition can be converted in real time with target acoustical feature according to the source acoustic feature Target speech data stream；Wherein, the target acoustical feature and the source acoustic feature are comprising identical voice content and not Same tamber characteristic.As a result, through the embodiment of the present invention, source audio data stream can be acquired in real time on one side, on one side by acquisition Source audio data stream is converted to target speech data stream in real time, since the target acoustical feature and the source acoustic feature include Identical voice content and different tamber characteristics.Therefore, the embodiment of the present invention can during user A speaks on one side, User's A one's voice in speech is converted into the sound of user B on one side, but the content that user A speaks remains unchanged, and then can be real Change of voice when real greatlys improve change of voice efficiency.And it can live for user and increase interest, it is diversified to meet user Demand.

It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.

Installation practice

Referring to Fig. 2, a kind of structural block diagram of voice processing apparatus embodiment of the invention is shown, described device specifically may be used To include:

Data acquisition module 201, for obtaining source audio data stream, the source audio data stream is by the voice that acquires in real time Data are formed；

Characteristic extracting module 202, for carrying out acoustic feature extraction to the source audio data stream, to obtain the source language The corresponding source acoustic feature of sound data flow；

Data conversion module 203, for successively turning the source audio data stream of acquisition in real time according to the source acoustic feature It is changed to the target speech data stream with target acoustical feature；Wherein, the target acoustical feature and the source acoustic feature packet Containing identical voice content and different tamber characteristics.

Optionally, the characteristic extracting module 202, can specifically include:

Framing submodule, for carrying out sub-frame processing to the source audio data stream, to obtain the source audio data stream Corresponding voice frame sequence；

Feature extraction submodule, for successively carrying out acoustic feature extraction to the speech frame in the voice frame sequence, with Obtain the corresponding source acoustic feature of the speech frame；Wherein, the source acoustic feature includes: the corresponding source voice of the speech frame Content and the corresponding source tamber characteristic of the speech frame.

Optionally, the data conversion module 203, can specifically include:

Contents extraction submodule, for successively being corresponded to from the speech frame for the speech frame in the voice frame sequence Source acoustic feature in extraction source voice content；

Feature Conversion submodule generates the voice for the source voice content according to target tamber characteristic and extraction The corresponding target acoustical feature of frame；Wherein, the target tamber characteristic is different from the source tamber characteristic；

Speech synthesis submodule, for carrying out speech synthesis to the corresponding target acoustical feature of the speech frame, to obtain Target speech data stream.

Optionally, described device can also include:

State determining module, for the corresponding source acoustic feature of speech frame in the voice frame sequence to be sequentially input acoustics Model, to export the corresponding acoustic states probability of the speech frame；

State recording module, for recording the corresponding acoustic states probability of the speech frame；

The data conversion module 203, can specifically include:

Data conversion submodule is used for the corresponding source acoustic feature of speech frame and acoustic states in the voice frame sequence Probability inputs speech synthesis network, the source acoustic feature of the speech frame is converted to mesh by the speech synthesis network Acoustic feature is marked, and the target acoustical feature is synthesized to obtain target speech data stream；Wherein, the speech synthesis Network is during being converted to target acoustical feature for the source acoustic feature of the speech frame, according to the mesh of previous speech frame Acoustic feature and acoustics state probability are marked, the target acoustical feature of current speech frame is calculated.

Optionally, the tamber characteristic at least may include any one in following feature: fundamental frequency, frequency spectrum and word speed.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

The embodiment of the invention provides a kind of devices for speech processes, include memory and one or one A above program, perhaps more than one program is stored in memory and is configured to by one or one for one of them It includes the instruction for performing the following operation that the above processor, which executes the one or more programs: obtaining source voice number According to stream, the source audio data stream is formed by the voice data acquired in real time；Acoustic feature is carried out to the source audio data stream It extracts, to obtain the corresponding source acoustic feature of the source audio data stream；According to the source acoustic feature, successively by the source of acquisition Audio data stream is converted to the target speech data stream with target acoustical feature in real time；Wherein, the target acoustical feature with The source acoustic feature includes identical voice content and different tamber characteristics.

Fig. 3 is a kind of block diagram of device 800 for speech processes shown according to an exemplary embodiment.For example, dress Setting 800 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical treatment Equipment, body-building equipment, personal digital assistant etc..

Referring to Fig. 3, device 800 may include following one or more components: processing component 802, memory 804, power supply Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, and Communication component 816.

The integrated operation of the usual control device 800 of processing component 802, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing element 802 may include that one or more processors 820 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more modules, just Interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, it is more to facilitate Interaction between media component 808 and processing component 802.

Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown Example includes the instruction of any application or method for operating on device 800, contact data, and telephone book data disappears Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 may include power management system System, one or more power supplys and other with for device 800 generate, manage, and distribute the associated component of electric power.

Multimedia component 808 includes the screen of one output interface of offer between described device 800 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when device 800 is in operation mode, when such as call model, logging mode and speech signal analysis mode, microphone It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.

I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented Estimate.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, and the relative positioning of component, for example, it is described Component is the display and keypad of device 800, and sensor module 814 can be with 800 1 components of detection device 800 or device Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800 Temperature change.Sensor module 814 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 814 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 816 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) skill can be handled based on radio-frequency information in NFC module Art, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of device 800 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

Fig. 4 is the structural schematic diagram of server in some embodiments of the present invention.The server 1900 can be because of configuration or property Energy is different and generates bigger difference, may include one or more central processing units (central processing Units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage applications The storage medium 1930 (such as one or more mass memory units) of program 1942 or data 1944.Wherein, memory 1932 and storage medium 1930 can be of short duration storage or persistent storage.The program for being stored in storage medium 1930 may include one A or more than one module (diagram does not mark), each module may include to the series of instructions operation in server.More into One step, central processing unit 1922 can be set to communicate with storage medium 1930, execute storage medium on server 1900 Series of instructions operation in 1930.

Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (server or Person's terminal) processor execute when, enable a device to execute method of speech processing shown in FIG. 1.

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium by device (server or Person's terminal) processor execute when, enable a device to execute a kind of method of speech processing, which comprises obtain source language Sound data flow, the source audio data stream are formed by the voice data acquired in real time；Acoustics is carried out to the source audio data stream Feature extraction, to obtain the corresponding source acoustic feature of the source audio data stream；It, successively will acquisition according to the source acoustic feature Source audio data stream be converted to the target speech data stream with target acoustical feature in real time；Wherein, the target acoustical is special Sign includes identical voice content and different tamber characteristics from the source acoustic feature.

The embodiment of the invention discloses A1, a kind of method of speech processing, comprising:

A2, method according to a1, it is described that acoustic feature extraction is carried out to the source audio data stream, it is described to obtain The corresponding source acoustic feature of source audio data stream, comprising:

Sub-frame processing is carried out to the source audio data stream, to obtain the corresponding speech frame sequence of the source audio data stream Column；

Acoustic feature extraction successively is carried out to the speech frame in the voice frame sequence, it is corresponding to obtain the speech frame Source acoustic feature；Wherein, the source acoustic feature includes: the corresponding source voice content of the speech frame and the speech frame Corresponding source tamber characteristic.

A3, the method according to A2, it is described according to the source acoustic feature, it is successively that the source audio data stream of acquisition is real When be converted to the target speech data stream with target acoustical feature, comprising:

For the speech frame in the voice frame sequence, the successively extraction source from the speech frame corresponding source acoustic feature Voice content；

According to target tamber characteristic and the source voice content of extraction, it is special to generate the corresponding target acoustical of the speech frame Sign；Wherein, the target tamber characteristic is different from the source tamber characteristic；

Speech synthesis is carried out to the corresponding target acoustical feature of the speech frame, to obtain target speech data stream.

A4, the method according to A2 are successively carrying out acoustic feature extraction to the speech frame in the voice frame sequence, After obtaining the corresponding source acoustic feature of the speech frame, the method also includes:

The corresponding source acoustic feature of speech frame in the voice frame sequence is sequentially input into acoustic model, the predicate to export The corresponding acoustic states probability of sound frame；

Record the corresponding acoustic states probability of the speech frame；

It is described according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical The target speech data stream of feature, comprising:

By the corresponding source acoustic feature of speech frame and acoustics state probability in the voice frame sequence, speech synthesis net is inputted Network, the source acoustic feature of the speech frame is converted to target acoustical feature by the speech synthesis network, and to institute Target acoustical feature is stated to be synthesized to obtain target speech data stream；

Wherein, the speech synthesis network is converted to the mistake of target acoustical feature in the source acoustic feature by the speech frame Cheng Zhong, according to the target acoustical feature and acoustics state probability of previous speech frame, the target acoustical for calculating current speech frame is special Sign.

A5, according to A1, into A4, any method, the tamber characteristic include at least any one in following feature Kind: fundamental frequency, frequency spectrum and word speed.

The embodiment of the invention discloses B6, a kind of voice processing apparatus, comprising:

B7, the device according to B6, the characteristic extracting module, comprising:

B8, the device according to B7, the data conversion module, comprising:

B9, the device according to B7, described device further include:

The data conversion module, comprising:

B10, according to B6, into B9, any device, the tamber characteristic include at least any one in following feature Kind: fundamental frequency, frequency spectrum and word speed.

The embodiment of the invention discloses C11, a kind of device for speech processes, include memory and one or The more than one program of person, one of them perhaps more than one program be stored in memory and be configured to by one or It includes the instruction for performing the following operation that more than one processor, which executes the one or more programs:

C12, the device according to C11, it is described that acoustic feature extraction is carried out to the source audio data stream, to obtain State the corresponding source acoustic feature of source audio data stream, comprising:

C13, the device according to C12, it is described according to the source acoustic feature, successively by the source audio data stream of acquisition The target speech data stream with target acoustical feature is converted in real time, comprising:

C14, the device according to C12, described device are also configured to by one or the execution of more than one processor The one or more programs include the instruction for performing the following operation:

Record the corresponding acoustic states probability of the speech frame；

C15, according to C11, into C14, any device, the tamber characteristic include at least any in following feature It is a kind of: fundamental frequency, frequency spectrum and word speed.

The embodiment of the invention discloses D16, a kind of machine readable media, instruction are stored thereon with, when by one or more When processor executes, so that device executes the method for speech processing as described in A1 one or more into A5.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Above to a kind of method of speech processing provided by the present invention, a kind of voice processing apparatus and a kind of at voice The device of reason, is described in detail, and specific case used herein explains the principle of the present invention and embodiment It states, the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile for this field Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of method of speech processing, which is characterized in that the described method includes:

Acoustic feature extraction is carried out to the source audio data stream, it is special to obtain the corresponding source acoustics of the source audio data stream Sign；

According to the source acoustic feature, the source audio data stream of acquisition is successively converted into the mesh with target acoustical feature in real time Mark audio data stream；Wherein, the target acoustical feature and the source acoustic feature include identical voice content and different Tamber characteristic.

2. the method according to claim 1, wherein described mention source audio data stream progress acoustic feature It takes, to obtain the corresponding source acoustic feature of the source audio data stream, comprising:

Sub-frame processing is carried out to the source audio data stream, to obtain the corresponding voice frame sequence of the source audio data stream；

Acoustic feature extraction successively is carried out to the speech frame in the voice frame sequence, to obtain the corresponding source sound of the speech frame Learn feature；Wherein, the source acoustic feature includes: that the corresponding source voice content of the speech frame and the speech frame are corresponding Source tamber characteristic.

3. according to the method described in claim 2, it is characterized in that, described according to the source acoustic feature, successively by acquisition Source audio data stream is converted to the target speech data stream with target acoustical feature in real time, comprising:

For the speech frame in the voice frame sequence, the successively extraction source voice from the speech frame corresponding source acoustic feature Content；

According to target tamber characteristic and the source voice content of extraction, the corresponding target acoustical feature of the speech frame is generated；Its In, the target tamber characteristic is different from the source tamber characteristic；

4. according to the method described in claim 2, it is characterized in that, successively being carried out to the speech frame in the voice frame sequence Acoustic feature extracts, after obtaining the corresponding source acoustic feature of the speech frame, the method also includes:

The corresponding source acoustic feature of speech frame in the voice frame sequence is sequentially input into acoustic model, to export the speech frame Corresponding acoustic states probability；

Record the corresponding acoustic states probability of the speech frame；

It is described according to the source acoustic feature, successively the source audio data stream of acquisition is converted in real time with target acoustical feature Target speech data stream, comprising:

By the corresponding source acoustic feature of speech frame and acoustics state probability in the voice frame sequence, speech synthesis network is inputted, The source acoustic feature of the speech frame is converted to target acoustical feature by the speech synthesis network, and to the mesh Mark acoustic feature is synthesized to obtain target speech data stream；

Wherein, the speech synthesis network is converted to the process of target acoustical feature in the source acoustic feature by the speech frame In, according to the target acoustical feature and acoustics state probability of previous speech frame, calculate the target acoustical feature of current speech frame.

5. according to claim 1 to any method in 4, which is characterized in that the tamber characteristic includes at least following special Any one in sign: fundamental frequency, frequency spectrum and word speed.

6. a kind of voice processing apparatus, which is characterized in that described device includes:

Data acquisition module, for obtaining source audio data stream, the source audio data stream is by the voice data shape that acquires in real time At；

Characteristic extracting module, for carrying out acoustic feature extraction to the source audio data stream, to obtain the source voice data Flow corresponding source acoustic feature；

Data conversion module, for the source audio data stream of acquisition to be successively converted to tool in real time according to the source acoustic feature There is the target speech data stream of target acoustical feature；Wherein, the target acoustical feature includes identical with the source acoustic feature Voice content and different tamber characteristics.

7. device according to claim 6, which is characterized in that the characteristic extracting module, comprising:

Framing submodule, it is corresponding to obtain the source audio data stream for carrying out sub-frame processing to the source audio data stream Voice frame sequence；

Feature extraction submodule, for successively carrying out acoustic feature extraction to the speech frame in the voice frame sequence, to obtain The corresponding source acoustic feature of the speech frame；Wherein, the source acoustic feature includes: in the corresponding source voice of the speech frame Appearance and the corresponding source tamber characteristic of the speech frame.

8. device according to claim 7, which is characterized in that the data conversion module, comprising:

Contents extraction submodule, the speech frame for being directed in the voice frame sequence, successively from the corresponding source of the speech frame Extraction source voice content in acoustic feature；

Feature Conversion submodule generates the speech frame pair for the source voice content according to target tamber characteristic and extraction The target acoustical feature answered；Wherein, the target tamber characteristic is different from the source tamber characteristic；

Speech synthesis submodule, for carrying out speech synthesis to the corresponding target acoustical feature of the speech frame, to obtain target Audio data stream.

9. a kind of device for speech processes, which is characterized in that include memory and one or more than one journey Sequence, perhaps more than one program is stored in memory and is configured to by one or more than one processor for one of them Executing the one or more programs includes the instruction for performing the following operation:

10. a kind of machine readable media is stored thereon with instruction, when executed by one or more processors, so that device is held Method of speech processing of the row as described in one or more in claim 1 to 5.