CN102568472A

CN102568472A - Voice synthesis system with speaker selection and realization method thereof

Info

Publication number: CN102568472A
Application number: CN2010105891201A
Authority: CN
Inventors: 吴悦
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: Shengle Information Technolpogy Shanghai Co Ltd
Priority date: 2010-12-15
Filing date: 2010-12-15
Publication date: 2012-07-11

Abstract

The invention discloses a voice synthesis system with speaker selection and a realization method of the system. The voice synthesis system comprises a target speaker data extracting device, a model self-adaption device and a target speaker voice synthesis device. The realization method comprises the following steps: (A) collecting voice data of a target speaker by using the target speaker data extracting device; (B) according to the voice data of the target speaker, generating a target speaker model and storing in a target speaker model library by using the model self-adaption device; and (C) after users activate the voice synthesis system, realizing a voice synthesis function by using the target speaker voice synthesis device. The voice synthesis system in a mobile phone-embedded version can select appointed target persons to read short messages and mobile phone texts according to the favors of the users so that the functions of mobile phones are expanded, and the process that the users use the mobile phones for acquiring messages is full of enjoyment and interactivity. In addition, the voice synthesis system can also be applied to platforms besides the mobile phones.

Description

The speech synthesis system that the speaker is optional and its implementation

Technical field

The present invention relates to a kind of speech synthesis system, the optional speech synthesis system of especially a kind of speaker.The invention still further relates to the implementation method of this speech synthesis system.

Background technology

Current cell phone platform is generally expressed the content information of note or text with literal, form is single, does not have entertaining, and is interactive not strong.And speech synthesis technique can address this problem to a certain extent, is about to Word message and converts audio frequency into, reads aloud to the user with the literal in the voice handle machine and listens.But existing speech synthesis system is simplification mostly, and a synthesis system generally includes only one to two speaker, still can not satisfy the diversified affection need of user.If the user dislikes speaker's sound that system carries, even also can produce resentment to using system.

Existing technology can address the above problem to a certain extent; Like Chinese patent number is 200480010899.X; The patent that name is called " text-to-speech system that depends on the source " has been described and has a kind ofly been generated the method for voice from text message, and this method comprises the speech feature vector of the sound of confirming to be associated with the source of text message, and relatively this speech feature vector and a plurality of speaker model; It is given and fixing by system that but its shortcoming is the speaker model, not strong for the adaptability of customer requirements.

Chinese patent number is 01116305.4, and the patent that name is called " by the method for text generation personalized speech " has been introduced a kind of method of concrete generation adaptive model, but sets forth the concrete grammar that obtains target speaker speech data.

In addition, except above-mentioned cell phone platform, at present also not for other platforms, user experience effect speech synthesis system preferably.

Summary of the invention

The technical matters that the present invention will solve provides the optional speech synthesis system of a kind of speaker; It is rich in interest and expressive force; Not only can promote the enjoyment of linking up between the user (as utilizing the enjoyment of messaging communication between the cellphone subscriber), also can promote the user for the experience of reading.

For solving the problems of the technologies described above, the speech synthesis system that speaker of the present invention is optional comprises:

Target speaker's data extract device is used to extract target speaker's speech data, and these data comprise voice data and corresponding text data; This device comprises: recording module is used to record target speaker voice; The text library of band phoneme characteristic is used to offer the target speaker and reads aloud; Sound identification module is used for target speaker's voice (voice data) of being recorded are converted into corresponding text data; Wherein, in this recording module, the source of sound of recording target speaker voice comprises: environment sound, telephone relation voice;

Model self-adaption device is used for generating and choosing the intended target speaker model, and this device comprises: speaker's modular converter is used for generating the target speaker model according to target speaker's speech data; Target speaker model storehouse is used to store the target speaker model;

Target speaker's speech synthetic device is used to generate the synthetic speech that the target speaker reads aloud text, and this device comprises: text analysis model is used for reading aloud text analysis; The phonetic synthesis module is used to generate intended target speaker's the synthetic speech of reading aloud fixed text.

The optional speech synthesis system of speaker of the present invention can be applied to comprise the speech synthesis system of cell phone platform, email platforms, voice broadcast platform.

Another technical matters that the present invention will solve provides the implementation method of above-mentioned speech synthesis system.

For solving the problems of the technologies described above, the implementation method of the speech synthesis system that speaker of the present invention is optional comprises step:

(A) target speaker data extract device is gathered target speaker's speech data;

(B) model self-adaption device generates the target speaker model according to target speaker's speech data, and is stored to target speaker model storehouse;

(C) behind this speech synthesis system of user activation, target speaker's speech synthetic device is realized speech-sound synthesizing function according to the following step:

(1) user's specify text and name;

Wherein, in the speech synthesis system that is applied to cell phone platform, the user can be through following mode specify text and name:

1. the name of target speaker model in the speech synthesis system and cell phone address book is bound, with fixing name be sender's note as specify text, relevant people is called the appointment name;

2. to be stored in text in the mobile phone as specify text, the user manually specifies name;

(2) text analysis model is analyzed text;

(3) the phonetic synthesis module extracts corresponding model according to name from target speaker model storehouse, and according to the analysis result of text analysis model, generates the synthetic speech that the target people reads aloud text;

(4) play the voice that synthesized.

In the said step (A), target speaker's data extract device can be decided in its sole discretion with following any mode by the user target speaker is carried out the speech data extraction:

(1) by the target speaker read aloud the appointment of target speaker data extract device band phoneme characteristic text and with recording module recording, with the text of the band phoneme characteristic of appointment as text data, with the voice recorded as voice data;

Wherein, the Chinese character in the text of the band phoneme characteristic of appointment should cover all syllables;

(2) read aloud any free text and, convert institute's recorded speech into text by sound identification module again by the target speaker with recording module recording, with the text as text data, with the voice recorded as voice data;

(3) utilizing recording module to record target speaker's call voice, is text by sound identification module with the speech conversion of being recorded again, with the text as text data, with the voice recorded as voice data.

Record length in mode (2) and (3) must satisfy the fixed time of target speaker data extract device; If the duration of single recording does not meet the demands; Then need repeatedly recording to make the total duration of audio frequency satisfy the appointment requirement of target speaker data extract device, and with the voice data of the audio frequency summation that meets the demands as the target speaker.

In the speech synthesis system of the present invention; In order to improve synthetic target speaker's voice quality; Promptly obtain the high target speaker model of parameter matching degree, this speech synthesis system has comprised the text that contains complete phoneme characteristic and has offered the target speaker and read aloud and record; If the user dislikes this data acquisition modes, also can let the target speaker read aloud the text of random length and record or record the calling record with the target speaker, with voice identification mode identification content of text, recording must be satisfied the appointment duration again.

The speech synthesis system that is applied to cell phone platform of the present invention can combine to read note and read two kinds of functions of mobile phone text.In addition, the user can bind the name in system and the cell phone address book, utilizes target speaker's massage voice reading SMS, also can specify text fragment in any mobile phone to utilize target speaker's massage voice reading; When the target speaker model storehouse of system and the name in the cell phone address book are bound, when receiving target speaker's note, the user can use this people's sound to read note.For the stored text of other mobile phones, this system also can let user's intended target speaker that it is read aloud.

Therefore, speech synthesis system of the present invention is rich in interest and expressive force, can promote the enjoyment of linking up between the user, and various reading experience can be provided.In addition, speech synthesis system of the present invention also can be applicable to the platform except that mobile phone, like email platforms, voice broadcast platform etc.

Description of drawings

Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed explanation:

Fig. 1 is the module diagram of speech synthesis system of the present invention;

Fig. 2 is a system of the present invention operational scheme synoptic diagram;

Fig. 3 is the schematic flow sheet that the present invention gathers target speaker data.

Embodiment

Understand for technology contents of the present invention, characteristics and effect being had more specifically, be example and combine illustrated embodiment that with the optional speech synthesis system of the speaker of cell phone platform details are as follows at present:

The speech synthesis system that the speaker of cell phone platform of the present invention is optional is based on the embedded development version of mobile phone operating system, can be used for the voice of synthetic target speaker note and reads aloud or utilize the mobile phone text of target speaker's massage voice reading appointment.This speech synthesis system comprises: target speaker's data extract device, model self-adaption device and target speaker speech synthetic device.Wherein, the module diagram of this speech synthesis system, as shown in Figure 1.

Target speaker's data extract device is used to extract target speaker's speech data, and these data comprise voice data and corresponding text data.Wherein, this target speaker data extract device comprises:

Recording module is used to record target speaker voice; This recording module can be to recording from the source of sound of environment sound or telephone relation voice;

The text library of band phoneme characteristic is used to offer the target speaker and reads aloud;

Sound identification module is used for target speaker's voice of being recorded are converted into corresponding text data.

For many-side satisfies user's hobby, target speaker's data extract device can be selected in following 3 kinds of modes any one voluntarily by the user, the target speaker is carried out speech data extract (shown in Figure 3):

(1) reads aloud the text of the band phoneme characteristic that target speaker data extract device extracts by the target speaker from the text library of band phoneme characteristic and it is recorded with recording module; With the text of the band phoneme characteristic of appointment as text data, with the voice recorded as voice data;

Wherein, the Chinese character in the text of the band phoneme characteristic of appointment covers all syllables;

(2) very dull if the user feels to read aloud specify text; Can also read aloud any free text and it is recorded by the target speaker with recording module; Convert institute's recorded speech into text by sound identification module again, with the text as text data, with the voice recorded as voice data;

(3) user can utilize recording module to record to the mobile phone communication voice with the target speaker, converts institute's recorded speech into text by sound identification module again, with the text as text data, with the voice recorded as voice data.Can not receive distance limit like this, but audio quality can reduce further simultaneously.

Wherein, The quality of data that (1) kind mode obtains is the highest; Back dual mode [(2), (3) mode] must notice that long recording time will satisfy the appointment duration of target speaker data extract device; If the duration of single recording does not meet the demands, then needing repeatedly records makes the total duration of audio frequency satisfy specified requirement, and with the voice data of the audio frequency summation that meets the demands as the target speaker.

Model self-adaption device is used for generating and choosing the intended target speaker model.This model self-adaption device comprises:

Speaker's modular converter is used for generating the target speaker model according to target speaker's speech data;

Target speaker model storehouse is used to store the target speaker model.

After speaker's modular converter of model self-adaption device obtains target speaker data; Can utilize existing adaptive technique the source speaker model to be carried out the model parameter mapping with target speaker data; Obtain the target speaker model, store the gained model into target speaker model storehouse afterwards.According to user's requirement, the model in the model bank can select with cell phone address book in name bind, be used to read the note of specifying the speaker to send.

Target speaker's speech synthetic device is used for the synthetic speech according to the content generation text target speaker of user's specify text.This target speaker speech synthetic device comprises:

The text analysis model of front end is used for reading aloud text analysis; For example, how each literal of analyzing in the text is read, and how to make pauses in reading unpunctuated ancient writings etc.;

The phonetic synthesis module of rear end is used to generate intended target speaker's the synthetic speech of reading aloud fixed text.

In target speaker speech synthetic device; The user can be through obtaining target speaker note and the content of selecting the mode specify text of existing text in the mobile phone, and this dual mode is respectively the target speaker and manual intended target speaker that automatic selection cell phone address book is bound for target speaker's specific mode.The analysis result that text analysis model through front end obtains is passed to the phonetic synthesis module of rear end, generates this by the phonetic synthesis module and reads aloud the synthetic speech of text.

Do further detailed explanation in the face of the implementation method of speech synthesis system of the present invention down.This implementation method, as shown in Figure 2, its concrete steps comprise:

(A) target speaker data extract device is gathered target speaker's speech data:

The user invites the desired destination speaker to participate in the data acquisition link.Target speaker can from before the mode selecting described three kinds of modes (shown in Figure 3) oneself to be inclined to carry out data acquisition.It should be noted that the data acquisition modes of non-bright read apparatus specify text, long recording time must satisfy the system requirements duration.

Such as, user Zhang San expects Li Si's speech data, for model adaptation is prepared.This moment is if Li Si is on the scene; So Zhang San invites Li Si to read the text of the band phoneme characteristic that target speaker data extract device provides; And with recording module recording, system preserves voice data behind the End of Tape, and with the text of the band phoneme characteristic formulated as text data.If Li Si dislikes reading text; Then Zhang San asks his words that make some casual remarks; Be arbitrary text, Zhang San records to it with recording module simultaneously, when the appointment duration of the discontented foot-eye speaker data extract device of long recording time; Device provides corresponding prompting, and this moment, Zhang San can record to Li Si once more or repeatedly.Sound identification module goes out corresponding text as text data to audio identification behind the End of Tape.If Li Si is absent from the scene, Zhang San just makes a phone call to Li Si, in communication process, with recording module Li Si's dialog context is recorded.When the appointment duration of the discontented foot-eye speaker data extract device of long recording time, device provides corresponding prompting, and Zhang San can be once more or repeatedly made a phone call with Li Si and record at this moment.Sound identification module goes out corresponding text as text data to audio identification behind the End of Tape.

(B) model self-adaption device generates the target speaker model according to target speaker's speech data, and is stored to target speaker model storehouse:

After target speaker data satisfy the speech synthesis system requirement; Whether the system prompt user need carry out model adaptation immediately; The user selects, and then model self-adaption device starts speaker's modular converter and begins to carry out model adaptation, and user's choosing is not; Carry out model adaptation after then can selecting again, perhaps collection is carried out model adaptation after more speaking more and talking about personal data again.

The target speaker model that obtains can be stored in target speaker model storehouse, and after obtaining a model, the user can also cover with new data training new model and to the model in the storehouse.Whether the system prompt user will be tied to cell phone address book to the target speaker model after a target speaker model is preserved; User's choosing is; Then system can open cell phone address book and offers the user and carry out name and select, and the user selects to accomplish behind the corresponding name and binds.

Zhang San has obtained Li Si's speech data such as hypothesis, so he selects Li Si's data and utilizes speaker's modular converter self-adaptation to obtain Li Si's speech model in system, and is kept in the target speaker model storehouse.This moment, whether system prompt will be bound with the name in the cell phone address book, and Zhang San selects is, and selects Li Si to accomplish binding at address list.Zhang San's new speech data of one section Li Si of having got back after a period of time; So Zhang San has obtained Li Si's speech model with new data self-adaptation again; And covered the Li Si's among the target speaker originally speech model, and the name " Li Si " in adversary's machine address list is bound again.

(C) behind this speech synthesis system of user activation, target speaker's speech synthetic device is realized speech-sound synthesizing function:

When this speech synthesis system of user activation and after certain name having been carried out the model binding; When after this receiving this person's note again; Whether system can point out will read note, is that then target speaker speech synthetic device can synthesize reading aloud voice and playing of this note if select.

Li Si's speech model is bound " Li Si " in the address list such as Zhang San; This moment, Zhang San received Li Si's note; Whether system can point out will read note, and Zhang San selects is, so the voice of the synthetic bright reading short message of Li Si of target speaker speech synthetic device and broadcast.

In addition, no matter whether certain target speaker model is bound to cell phone address book, and this model all can be used to read other mobile phone texts.Method is: the user opens system; And select to open the text document of certain specified path in system; Model in the manual more afterwards select target speaker model storehouse, the target speaker of the synthetic current page document of definite back target speaker speech synthetic device reads aloud voice and plays.

Such as, Zhang San wants to read aloud certain text document with king five sound.King five speech model is not bound with cell phone address book, but Zhang San still can manually select king five speech model in system, confirms the voice of back with the synthetic king's five reading text of target speaker speech synthetic device.

The speech synthesis system of the above-mentioned embedded version of mobile phone can select the intended target people to read note and mobile phone text according to user preferences, has expanded the function of mobile phone, and the process that makes the user utilize mobile phone to obtain information more is full of interesting and interactive.

In addition, though only introduced the example that is applied to cell phone platform, according to as stated, the present invention can be applicable to other platforms fully, like email platforms, voice broadcast platform.

The speech synthesis system that speaker of the present invention is optional and its implementation; Generate target speaker's speech model through target speaker's speech data self-adaptation; Model bank is dynamic and higher with speaker's matching degree of user expectation, and the present invention adopted concrete speech data acquisition method, can be adapted to different scenes; Put forth effort on making speech data comprise more complete phoneme characteristic simultaneously, obtaining the higher target speaker model of parameter matching becomes possibility.Therefore, the present invention's process that can make the user obtain information more is full of interesting and interactive.

Claims

1. speech synthesis system that the speaker is optional, it is characterized in that: this speech synthesis system comprises:

Target speaker's data extract device is used to extract target speaker's speech data;

Model self-adaption device is used for generating and choosing the intended target speaker model;

Target speaker's speech synthetic device is used to generate the synthetic speech that the target speaker reads aloud text.

2. the optional speech synthesis system of speaker as claimed in claim 1 is characterized in that: said target speaker's data extract device comprises: recording module is used to record target speaker voice; The text library of band phoneme characteristic is used to offer the target speaker and reads aloud; Sound identification module is used for target speaker's voice of being recorded are converted into corresponding text data;

Model self-adaption device comprises: speaker's modular converter is used for generating the target speaker model according to target speaker's speech data; Target speaker model storehouse is used to store the target speaker model;

Target speaker's speech synthetic device comprises: text analysis model is used for reading aloud text analysis; The phonetic synthesis module is used to generate intended target speaker's the synthetic speech of reading aloud fixed text.

3. the optional speech synthesis system of speaker as claimed in claim 1 is characterized in that: said speech synthesis system is a kind of speech synthesis system that is applied to comprise cell phone platform, email platforms, voice broadcast platform.

4. the optional speech synthesis system of speaker as claimed in claim 1 is characterized in that: in said target speaker's data extract device, target speaker's speech data comprises voice data and corresponding text data.

5. the optional speech synthesis system of speaker as claimed in claim 2 is characterized in that: in the said recording module, the source of sound of recording target speaker voice comprises: environment sound, telephone relation voice.

6. like the implementation method of the optional speech synthesis system of each described speaker of claim 1-5, comprise step:

(C) behind this speech synthesis system of user activation, target speaker's speech synthetic device is realized speech-sound synthesizing function.

7. the implementation method of the speech synthesis system that speaker as claimed in claim 6 is optional is characterized in that: in the said step (A), target speaker's data extract device is with following any one mode the target speaker to be carried out speech data to extract:

8. the implementation method of the speech synthesis system that speaker as claimed in claim 6 is optional is characterized in that: in said (C), target speaker's speech synthetic device is realized speech-sound synthesizing function according to the following step:

(1) user's specify text and name;

(2) text analysis model is analyzed text;

(4) play the voice that synthesized.

9. the implementation method of the speech synthesis system that speaker as claimed in claim 8 is optional is characterized in that: in said (1), in the speech synthesis system that is applied to cell phone platform, the user is through following mode specify text and name:

2. to be stored in text in the mobile phone as specify text, the user manually specifies name.

10. the implementation method of the speech synthesis system that speaker as claimed in claim 7 is optional is characterized in that: the Chinese character in the text of the band phoneme characteristic of the appointment in said (1) covers all syllables.

11. the implementation method of the speech synthesis system that speaker as claimed in claim 7 is optional; It is characterized in that: the record length in said (2) and (3) must satisfy the fixed time of target speaker data extract device; If the duration of single recording does not meet the demands; Then need repeatedly recording to make the total duration of audio frequency satisfy the appointment requirement of target speaker data extract device, and with the voice data of the audio frequency summation that meets the demands as the target speaker.