CN105931631A

CN105931631A - Voice synthesis system and method

Info

Publication number: CN105931631A
Application number: CN201610236400.1A
Authority: CN
Inventors: 曹立新
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2016-04-15
Filing date: 2016-04-15
Publication date: 2016-09-07

Abstract

The invention provides a voice synthesis system and method, and the method comprises the steps: collecting a plurality of pieces of synthesis material information, and carrying out the preprocessing of each piece of synthesis material information, so as to extract the synthesis feature information, wherein the synthesis material information comprises at least one type of text information, voice information and image information; carrying out the prediction of all synthesis feature information through a prediction model, so as to generate acoustic parameter information; and generating voice synthesis result information according to the acoustic parameter information. The system and method respectively collect at least one type of text information, voice information and image information, so as to extract the synthesis feature information. The system and method employ the prediction model for prediction, finally generate voice, predict the emotion or context of a user through the feature information extracted from the voice information and/or image information, and achieve the synthesis of personalized voice expressing the emotion or context of the user.

Description

Speech synthesis system and method

Technical field

The application relates to speech synthesis technique field, is specifically related to a kind of speech synthesis system and side Method.

Background technology

The phonetic synthesis solution of existing text converting speech (Text To Speech is called for short TTS) Certainly scheme mainly has two classes, a class to be splicing systems, and an other class is that parameter generates system.Two The something in common of class system is to be required for carrying out text analyzing, and difference is splicing system Utilize a large amount of fragment voice recorded, in conjunction with text analyzing result, recording fragment is spliced Obtain synthesizing voice；And parameter generates system and utilizes the result of text analyzing, produced by model The parameter of voice, such as fundamental frequency etc., and then changes into waveform.

The model training modeling of existing system, only with the feature of text, or off-line sound Feature, it was predicted that time only employ the feature of text, and do not account for the expression of user, surrounding Environment and user use the information such as the voice that different context expresses under different emotions state.Cause This existing system, due to the defect of the above-mentioned state that cannot observe surrounding environment and user, causes institute The voice generated is natural not, and lacks emotion, for the same text under different context, Generate is all same voice every time.

Summary of the invention

In view of drawbacks described above of the prior art or deficiency, it is desirable to provide a kind of by combining text With image, voice, the phonetic synthesis system of the personalized speech of user feeling or linguistic context is expressed in synthesis System and method.

First aspect, the present invention provides a kind of speech synthesis system, and described system includes:

Feature extraction unit, is used for gathering some synthesis material information, and respectively to every institute State synthesis material information to carry out pre-processing to extract composite character information；Wherein, described synthetic chemical Material information includes text message, and at least one category information in voice messaging and image information；

Predicting unit, for every described composite character information being predicted by forecast model, To generate parameters,acoustic information；

Synthesis unit, for generating phonetic synthesis object information according to described parameters,acoustic information.

Second aspect, the present invention provides a kind of phoneme synthesizing method, and described method includes:

Gather some synthesis material information, and respectively every described synthesis material information is carried out Pretreatment is to extract composite character information；Wherein, described synthesis material information includes text message, And at least one category information in voice messaging and image information；

By forecast model, every described composite character information is predicted, to generate acoustics ginseng Number information；

Phonetic synthesis object information is generated according to described parameters,acoustic information.

Speech synthesis system and method that the many embodiments of the present invention provide are passed through to gather text respectively At least one category information in information, and voice messaging and image information, to extract every synthesis Characteristic information, and be predicted by forecast model, ultimately generate voice, pass through voice messaging And/or the characteristic information prediction emotion of user extracted of image information or linguistic context, it is achieved that synthesis Express user feeling or the personalized speech of linguistic context；

Speech synthesis system and method that some embodiments of the invention provide are civilian by gathering further At least one category information in this information, and voice messaging and image information, to extract every instruction Practice characteristic information, train forecast model, extend the match-type of forecast model, improve pre- The precision surveyed.

Accompanying drawing explanation

By reading retouching in detail with reference to made non-limiting example is made of the following drawings Stating, other features, purpose and advantage will become more apparent upon:

Fig. 1 is the structural representation of speech synthesis system in one embodiment of the invention.

Fig. 2 is the flow chart of phoneme synthesizing method in one embodiment of the invention.

Fig. 3 is the structural representation of speech synthesis system in one embodiment of the present invention.

Fig. 4 is the flow chart of phoneme synthesizing method in one embodiment of the present invention.

Fig. 5 is the structural representation of speech synthesis system in one embodiment of the present invention.

Detailed description of the invention

With embodiment, the application is described in further detail below in conjunction with the accompanying drawings.It is appreciated that , specific embodiment described herein is used only for explaining related invention, rather than to this Bright restriction.It also should be noted that, for the ease of describe, accompanying drawing illustrate only with The part that invention is relevant.

It should be noted that in the case of not conflicting, the embodiment in the application and embodiment In feature can be mutually combined.Describe this below with reference to the accompanying drawings and in conjunction with the embodiments in detail Application.

As it is shown in figure 1, in the present embodiment, the speech synthesis system that the present invention provides includes spy Levy extraction unit 10, predicting unit 30 and synthesis unit 50.

Feature extraction unit 10 is used for gathering some synthesis material information, and respectively to every institute State synthesis material information to carry out pre-processing to extract composite character information.

Predicting unit 30 is for carrying out pre-by forecast model to every described composite character information Survey, to generate parameters,acoustic information.

Synthesis unit 50 is for generating phonetic synthesis object information according to described parameters,acoustic information.

In certain embodiments, described synthesis material information includes text message and voice messaging； In further embodiments, described synthesis material information includes text message and image information；? In some other embodiment, described synthesis material information includes text message, voice messaging simultaneously And image information.

In the present embodiment, it was predicted that unit 30 storage has the forecast model trained；Real at some Execute in example, the forecast model of renewal can be received further by modes such as radio communications.

In the present embodiment, synthesis unit 50 is vocoder, and described parameters,acoustic information includes base Frequency and formant frequency.In more embodiments, synthesis unit 50 can use according to the actual requirements Different sound synthesis device, described parameters,acoustic information can use corresponding parameters,acoustic.

In certain embodiments, described forecast model is Logic Regression Models；Implement at other In example, described forecast model is deep neural network model.

Fig. 2 is the flow chart of phoneme synthesizing method in one embodiment of the invention.Language shown in Fig. 2 Sound synthetic method can be applicable in the speech synthesis system shown in Fig. 1.

As in figure 2 it is shown, in the present embodiment, the phoneme synthesizing method that the present invention provides includes:

S30: gather some synthesis material information, and respectively to every described synthesis material information Carry out pre-processing to extract composite character information.Wherein, described synthesis material information includes text At least one category information in information, and voice messaging and image information.

S50: every described composite character information is predicted, to generate sound by forecast model Learn parameter information.

S70: generate phonetic synthesis object information according to described parameters,acoustic information.

Specifically, by the forecast model conjunction to being extracted based on voice messaging and/or image information Characteristic information is become to be predicted, it is achieved emotion and linguistic context to user judge, thus generate Corresponding parameters,acoustic information.Such as, text message " is please shut door ", existing language Sound synthesis system generally only can generate flat a, voice without emotion and the tone；And In the present embodiment, extract the composite character information corresponding to the soft tone by voice messaging, Or extract the composite character information corresponding to anger expression by image information, then by prediction Synthesis characteristic information is predicted by model, generates the corresponding soft tone or the acoustics of the exciting tone Parameter information, ultimately produce the tone with user or expression is corresponding, user feeling can be expressed or The phonetic synthesis object information of linguistic context.

Speech synthesis system and method that above-described embodiment provides are passed through to gather text message respectively, And at least one category information in voice messaging and image information, to extract every composite character letter Breath, and be predicted by forecast model, ultimately generate voice, by voice messaging and/or figure The emotion of the characteristic information prediction user extracted as information or linguistic context, it is achieved that synthesis is expressed and used Family emotion or the personalized speech of linguistic context.

As it is shown on figure 3, in a preferred embodiment, described system also includes model training unit 20.Feature extraction unit 10 is additionally operable to gather some training material information, and respectively to every Described training material information carries out pre-processing to extract training characteristics information.Model training unit 20 For training forecast model according to every described training characteristics information.

With described synthesis material information similarly, in certain embodiments, described training material letter Breath includes text message and voice messaging；In further embodiments, described training material information Including text message and image information；In some other embodiment, described training material information Include text message, voice messaging and image information simultaneously.

Specifically, in the present embodiment, new prediction can be trained by model training unit 20 Model, it is possible to the training pattern prestoring predicting unit 30 is further trained to improve pre- The accuracy rate surveyed.

Fig. 4 is the flow chart of phoneme synthesizing method in one embodiment of the present invention.Shown in Fig. 4 Phoneme synthesizing method can apply in the speech synthesis system shown in Fig. 3.

As shown in Figure 4, in a preferred embodiment, also include before step S30:

S10: gather some training material information, and respectively to every described training material information Carry out pre-processing to extract training characteristics information.Wherein, described training material information includes text At least one category information in information, and voice messaging and image information.

S20: train forecast model according to every described training characteristics information.

The speech synthesis system of above-described embodiment offer and method are further by gathering text envelope At least one category information in breath, and voice messaging and image information, to extract every training spy Reference ceases, and trains forecast model, extends the match-type of forecast model, improve prediction Precision.

As it is shown in figure 5, in a preferred embodiment, feature extraction unit 10 includes:

Text character extraction subelement 101, for gathering the first text message, and to described the One text message pre-processes, to extract the first text feature information for prediction；

The most also include following at least one:

Speech feature extraction subelement 103, for gathering the first voice messaging, and to described the One voice messaging pre-processes, to extract for Prediction and Acquisition environment and the first of user's linguistic context Voice characteristics information；

Image characteristics extraction subelement 105, for gathering the first image information, and to described the One image information pre-processes, to extract for predicting the first characteristics of image letter that user expresses one's feelings Breath.

Specifically, the first voice characteristics information, also can be pre-in addition to being used for predicting user's linguistic context Survey gathers the features such as the noisy degree of environment, with the further accuracy rate improving prediction.

In the preferred embodiment of a correspondence, step S30 includes:

S301: gather the first text message, and described first text message is pre-processed, To extract the first text feature information for prediction；

The most also include following at least one:

S302: gather the first voice messaging, and described first voice messaging is pre-processed, To extract for Prediction and Acquisition environment and the first voice characteristics information of user's linguistic context；

S303: gather the first image information, and described first image information is pre-processed, To extract for predicting the first image feature information that user expresses one's feelings.

In a preferred embodiment, described described first text message is carried out pretreatment include right Described first text message carries out Text normalization process and prosody prediction.

Described described first voice messaging is carried out pretreatment include described first voice messaging is entered Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing.

Described described first image information is carried out pretreatment include described first image information is entered Row recognition of face, and extract relevant color, texture, shape and spatial relationship feature.

In a preferred embodiment, Text character extraction subelement 101 is additionally operable to gather the second literary composition This information, and described second text message is pre-processed, it is used for training prediction mould to extract Second text feature information of type.

Speech feature extraction subelement 103 is additionally operable to gather the second voice messaging, and to described Two voice messagings pre-process, to extract the second phonetic feature letter for training forecast model Breath.

Image characteristics extraction subelement 105 is additionally operable to gather the second image information, and to described Two image informations pre-process, to extract the second characteristics of image letter for training forecast model Breath.

In the preferred embodiment of a correspondence, step S10 includes:

S101: gather the second text message, and described second text message is pre-processed, To extract the second text feature information for training forecast model；

The most also include following at least one:

S103: gather the second voice messaging, and described second voice messaging is pre-processed, To extract the second voice characteristics information for training forecast model；

S105: gather the second image information, and described second image information is pre-processed, To extract the second image feature information for training forecast model.

Flow chart in accompanying drawing and block diagram, it is illustrated that according to various embodiments of the invention system, Architectural framework in the cards, function and the operation of method and computer program product.This point On, each square frame in flow chart or block diagram can represent a module, program segment or code A part, a part for described module, program segment or code comprise one or more for Realize the executable instruction of the logic function of regulation.It should also be noted that at some as replacement In realization, the function marked in square frame can also be sent out to be different from the order marked in accompanying drawing Raw.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, they Sometimes can also perform in the opposite order, depending on this is according to involved function.It is also noted that , square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart Combination, the special hardware based system of the function that can be specified by execution or operation come Realize, or can be realized by the combination of specialized hardware with computer instruction.

It is described in the embodiment of the present application involved unit or module can be by the side of software Formula realizes, it is also possible to realize by the way of hardware.Described unit or module can also Arrange within a processor, such as, it was predicted that unit 30 can be provided in computer or smart machine In software program；It can also be the hardware unit being individually predicted.Wherein, these unit Or the title of module is not intended that under certain conditions to this unit or the restriction of module itself, example As, it was predicted that unit 30 is also described as " beating for carrying out characteristic information with model mating Divide to generate the comparing unit of parameter ".

As on the other hand, present invention also provides a kind of computer-readable recording medium, this meter Calculation machine readable storage medium storing program for executing can be that computer included in device described in above-described embodiment can Read storage medium；Can also be individualism, the computer-readable storage being unkitted in the equipment of allocating into Medium.Computer-readable recording medium storage has one or more than one program, described program It is used for performing to be described in the Formula Input Technology side of the application by one or more than one processor Method.

Above description is only the preferred embodiment of the application and saying institute's application technology principle Bright.It will be appreciated by those skilled in the art that invention scope involved in the application, do not limit In the technical scheme of the particular combination of above-mentioned technical characteristic, also should contain simultaneously without departing from In the case of described inventive concept, above-mentioned technical characteristic or its equivalent feature are combined And other technical scheme formed.Such as features described above and (but not limited to) disclosed herein The technical characteristic with similar functions is replaced mutually and the technical scheme that formed.

Claims

1. a speech synthesis system, it is characterised in that described system includes:

Speech synthesis system the most according to claim 1, it is characterised in that described feature Extraction unit is additionally operable to gather some training material information, and respectively to every described training element Material information carries out pre-processing to extract training characteristics information；Wherein, described training material information bag Include text message, and at least one category information in voice messaging and image information；

Described system also includes:

Model training unit, for training forecast model according to every described training characteristics information.

Speech synthesis system the most according to claim 1 and 2, it is characterised in that described Feature extraction unit includes:

Text character extraction subelement, for gathering the first text message, and to described first literary composition This information pre-processes, to extract the first text feature information for prediction；

The most also include following at least one:

Speech feature extraction subelement, for gathering the first voice messaging, and to described first language Message breath pre-processes, to extract for Prediction and Acquisition environment and the first voice of user's linguistic context Characteristic information；

Image characteristics extraction subelement, for gathering the first image information, and to described first figure As information pre-processes, to extract for predicting the first image feature information that user expresses one's feelings.

Speech synthesis system the most according to claim 3, it is characterised in that described to institute State the first text message to carry out pretreatment and include described first text message is carried out Text normalization Process and prosody prediction；

Described described first voice messaging is carried out pretreatment include described first voice messaging is entered Row mel-frequency cepstrum coefficient (being called for short mfcc) feature extraction and digitized processing；

Speech synthesis system the most according to claim 3, it is characterised in that described text Feature extraction subelement is additionally operable to gather the second text message, and enters described second text message Row pretreatment, to extract the second text feature information for training forecast model；

Described speech feature extraction subelement is additionally operable to gather the second voice messaging, and to described Two voice messagings pre-process, to extract the second phonetic feature letter for training forecast model Breath；

Described image characteristics extraction subelement is additionally operable to gather the second image information, and to described Two image informations pre-process, to extract the second characteristics of image letter for training forecast model Breath.

Speech synthesis system the most according to claim 1, it is characterised in that described prediction Model is Logic Regression Models or deep neural network model.

7. a phoneme synthesizing method, it is characterised in that described method includes:

Phoneme synthesizing method the most according to claim 7, it is characterised in that described collection Some synthesis material information, and respectively every described synthesis material information is pre-processed with Also include before extracting composite character information:

Gather some training material information, and respectively every described training material information is carried out Pretreatment is to extract training characteristics information；Wherein, described training material information includes text message, And at least one category information in voice messaging and image information；

Forecast model is trained according to every described training characteristics information.

9. according to the phoneme synthesizing method described in claim 7 or 8, it is characterised in that described Gather some synthesis material information, and respectively every described synthesis material information is carried out pre-place Manage and include with extraction composite character information:

Gather the first text message, and described first text message is pre-processed, to extract The first text feature information for prediction；

The most also include following at least one:

Gather the first voice messaging, and described first voice messaging is pre-processed, to extract For Prediction and Acquisition environment and the first voice characteristics information of user's linguistic context；

Gather the first image information, and described first image information is pre-processed, to extract For predicting the first image feature information that user expresses one's feelings.

Phoneme synthesizing method the most according to claim 9, it is characterised in that described right Described first text message carries out pretreatment and includes described first text message is carried out text normalizing Change process and prosody prediction；

11. phoneme synthesizing methods according to claim 8, it is characterised in that described in adopt Collect some training material information, and respectively every described training material information is pre-processed Include with extraction training characteristics information:

Gather the second text message, and described second text message is pre-processed, to extract For training the second text feature information of forecast model；

The most also include following at least one:

Gather the second voice messaging, and described second voice messaging is pre-processed, to extract For training the second voice characteristics information of forecast model；

Gather the second image information, and described second image information is pre-processed, to extract For training the second image feature information of forecast model.

12. phoneme synthesizing methods according to claim 7, it is characterised in that described pre- Surveying model is Logic Regression Models or deep neural network model.