CN101222542B

CN101222542B - Method for implementing Text-To-Speech function

Info

Publication number: CN101222542B
Application number: CN2007101530700A
Authority: CN
Inventors: 陈诚
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2005-10-21
Filing date: 2005-10-21
Publication date: 2011-09-14
Anticipated expiration: 2025-10-21
Also published as: CN101222542A

Abstract

The invention discloses a method for realizing a text-to-speech conversion function, wherein, media resource control equipment controls media resource processing equipment to realize a text-to-speech conversion through an H.248 protocol. The method comprises the following steps: the media resource control equipment carries an extension packet parameter in an H.248 message to instruct the media resource processing equipment to execute a text-to-speech conversion processing corresponding to the parameter; the media resource processing equipment calls a text-to-speech converter to execute the text-to-speech conversion processing according to the parameter in the message and feeds back the text-to-speech conversion result to the media resource control equipment. With the method provided by the invention, a user can be provided with service applications related to the text-to-speech conversion in the media resource application of mobile or fixed networks; at the same time, only the text needs to be modified during the modification without need to rerecord, and prompt tones with much personality can be played according to the requirement of the user.

Description

A kind of method that realizes function of text-to-speech convert

Technical field

The present invention relates to a kind of method that realizes function of text-to-speech convert, particularly a kind of employing H.248 agreement realizes the method for function of text-to-speech convert as control protocol.

Background technology

Literary composition language switch technology is the voice technology of a core.It converts text message to the synthetic voice of machine, that provide convenience, friendly Man Machine Interface.Briefly exactly a text string is converted to voice.As input text " hello ", after the function of text-to-speech convert processing, the voice of output " hello " the words.

In the existing network system, application server has two kinds of methods usually to user's playback the time:

The 1st kind of method is to play-over a recording.As when user of customer call fails, system can be to user prompt " subscriber absent that you call out ", and this section prompt tone is to record in advance and be stored on the server apparatus.Existing perfect method in agreement H.248 is as agreement H.248.9.

The 2nd kind of method is to realize with function of text-to-speech convert.When customer call was failed, system became " subscriber absent that you call out " this text-converted voice to export to the user.

Use the benefit of literary composition language conversion to be:

(1) revises conveniently, when revising, only need revise text, do not need to record again;

(2) can play more personalized prompt tone according to user's request, as using male voice, female voice, neutral sound playing respectively.

H.248, the 2nd kind of above-mentioned method do not defining in the agreement, and the media resource applied environment need use function of text-to-speech convert, and at this point, the present invention proposes a kind of H.248 method of agreement realization function of text-to-speech convert of passing through.

Summary of the invention

The invention provides a kind of media resource controlling device and pass through the H.248 method of agreement indication media resource processing arrangements realization function of text-to-speech convert.

The method of realization function of text-to-speech convert of the present invention may further comprise the steps:

Step 1, the H.248 message of carrying literary composition language conversion indication that media resource processing arrangements receiving media resource control appliance sends, carry in the described H.248 message by the expanding packet parameter of protocol extension package definition H.248, indicate described media resource processing arrangements to carry out and the corresponding literary composition language of this parameter conversion process; And

Step 2, media resource processing arrangements is carried out literary composition language conversion process according to the literary composition language of the parameter call in above-mentioned message transducer, and with the literary composition transformation result feedback media resource controlling device of speaking.

Wherein, carry the relevant information of text string in this expanding packet parameter, media resource processing arrangements calls literary composition language transducer and carries out the conversion of literary composition language according to the relevant information of text string.

The relevant information of above-mentioned text string can be text string itself, it is as being embedded in H.248 in the message by the character string of orthoepy, after media resource processing arrangements receives text string, directly extract text string and call literary composition language transducer and carry out literary composition language and change.

When text string is stored on media resource processing arrangements or the external server in advance, the relevant information of above-mentioned text string can be the sign that comprises text string and the text of stored position information, after media resource processing arrangements receives above-mentioned text, according to stored position information wherein, from this locality or external server read text string and put into buffer memory, and call literary composition language transducer and carry out literary composition language and change.

Wherein, the relevant information of above-mentioned text string can comprise the text of text string and another text string, text file comprises the sign and the stored position information of this another text string, the sign of text file and text string are combined into the continuous text string, and before text sign, increase keyword and identify this and be combined as the text that pronounces, after media resource processing arrangements receives this combination, at first from this locality or external server read text string, put into buffer memory after itself and the pronunciation text string that H.248 carries in the message be connected in series, call literary composition language transducer then and carry out literary composition language and change.

Wherein, the relevant information of above-mentioned text string can comprise the combination of a text string and a recording file, and before text string, increase keyword and identify this and be combined as a voice document, after media resource processing arrangements receives this combination, at first call literary composition language transducer text string is carried out the conversion of literary composition language, voice and the recording file with literary composition language conversion back output makes up a voice snippet then.

Wherein, the relevant information of above-mentioned text string can be for comprising the combination of a text and a recording file, text file comprises the sign and the stored position information of this another text string, and before this sign, increase keyword and identify this and be combined as a voice document, after media resource processing arrangements receives this combination, at first according to stored position information from this locality or external server read text string and put into buffer memory, call literary composition language transducer then the text string that reads is carried out the conversion of literary composition language, and the voice and the recording file of literary composition language conversion back output made up a voice snippet.

In said method, H.248, this further carries the relevant parameter that the voice attributes of output changed in the literary composition language in message, this relevant parameter comprises: pronunciation category of language, pronunciation sex, pronunciation age, pause, can further include at least one in the following parameter: the articulation type of the rate of articulation, pronunciation volume, audio tone, special character, read, whether end the conversion of literary composition language again when the user imports, media resource processing arrangements receives and calls literary composition language transducer behind this relevant parameter and for the voice of output corresponding attribute is set.

Call literary composition language transducer at the step 2 media resource processing arrangements and carry out in the literary composition language transfer process, said method further comprises:

Step 21, media resource controlling device indication media resource processing arrangements detects the anomalous event that takes place in the speech recognition process.

When detecting anomalous event, media resource processing arrangements will represent that anomalous event corresponding error sign indicating number feeds back to media resource controlling device.

Further, media resource processing arrangements calls in the literary composition language transducer execution literary composition language transfer process in step 2, and said method also comprises:

Step 22, media resource controlling device is controlled literary composition language transfer process.

In step 22, media resource controlling device can comprise the control of literary composition language transfer process and temporarily stops the user being play voice after the conversion of literary composition language, and return to broadcast state from above-mentioned halted state.

In step 22, media resource controlling device can comprise making to the control of literary composition language transfer process plays F.F. or rewind down, and this F.F. comprises the some words of F.F., sentence or paragraph, and perhaps F.F. is some seconds, this rewind down comprises the some words of rewind down, sentence or paragraph, and perhaps rewind down is some seconds.

In step 22, media resource controlling device can comprise that to the control of literary composition language transfer process restarting the literary composition language changes.

In step 22, media resource controlling device comprises that to the control of literary composition language transfer process the user ends the conversion of literary composition language.

In step 22, media resource controlling device comprises the current sentence of repeat playing, paragraph or full text to the control of literary composition language transfer process, and the control of literary composition language transfer process is comprised that further cancellation is to current sentence, paragraph or repeat playing in full.

By method provided by the invention, can in using, the media resource of mobile or fixed network provide the literary composition language service application that conversion is correlated with to the user, as being changed into sound, the content on the webpage reads to listen to the user.Simultaneously, when revising, only need revise text, not need to record again, and can play more personalized prompt tone according to user's request.

Description of drawings

Fig. 1 is in the WCDMA IMS network, handles the network architecture of media resource business.

Fig. 2 is in fixing flexible exchanging network, handles the network architecture of media resource business.

Fig. 3 realizes the flow chart of the method for function of text-to-speech convert for the present invention.

Embodiment

Fig. 1 is in the WCDMA IMS network, handles the network architecture of media resource business.Wherein, application server 1 is used to handle miscellaneous service, for example to user's playback, collect the digits, meeting, recording etc.Service call session control equipment 2 is used to handle route, and the message that application server is sent correctly is transmitted to media resource controlling device 3, and perhaps the message that media resource controlling device 3 is sent correctly is routed to application server 1.Media resource controlling device 3 is used to control media resource, and it is according to the requirement of application server 1, selects corresponding media resource processing arrangements 4 and controls the processing of media resource.Media resource processing arrangements 4 is used for the processing of media resource, under the control of media resource controlling device 3, finishes the media resource operational processes that application server 1 issues.

Wherein, the interface that adopts between application server 1, service call session control equipment 2 and the media resource controlling device 3 uses Session Initiation Protocol and XML agreement, or the agreement of Session Initiation Protocol and similar XML (for example VXML).The interface that adopts between media resource controlling device 3 and the media resource processing arrangements 4 is a Mp, uses H.248 agreement.The external interface of media resource processing arrangements 4 is a Mb, generally adopts Real-time Transport Protocol carrying user media stream.

Fig. 2 is in fixing flexible exchanging network, handles the network architecture of media resource business.Wherein, Media Resource Server (Media Resource Server, MRS) be equivalent to the media resource controlling device 3 in the WCDMA IMS network and the function of media resource processing arrangements 4, application server is equivalent to the application server 1 in the WCDMA IMS network and the function of service call session control equipment 2, and Softswitch and application server 1 function are roughly the same.

Provided by the present inventionly realize that by agreement H.248 the media resource that the method for function of text-to-speech convert can be applied in WCDMA IMS network shown in Figure 1 and the fixedly flexible exchanging network shown in Figure 2 handles.Equally also can be applied to other network, as cdma network and fixing IMS network, the framework of its media resource application scenarios and operation flow and above-mentioned WCDMA IMS's is basic identical, and WCDMA, CDMA circuit flexible exchanging network, its media resource application architecture and operation flow and fixedly flexible exchanging network is basic identical.Just, the present invention can be applied to all realize function of text-to-speech convert by the control of agreement H.248 media resource apparatus situation.

To be example below, the H.248 method of agreement realization function of text-to-speech convert of passing through provided by the present invention will be described simultaneously with reference to the accompanying drawings to be applied to WCDMA IMS.

Here, because the present invention only relates to the processing procedure between media resource controlling device 3 shown in Figure 1 and the media resource processing arrangements 4, and other processes are identical with the processing procedure in the existing WCDMA IMS network, therefore, only the processing procedure between media resource controlling device 3 and the media resource processing arrangements 4 is described for simplification.

As shown in Figure 3, the flow chart that carries out the control and the processing of media resource for media resource controlling device 3 and media resource processing arrangements 4.

Step 1, media resource controlling device 3 are sent the indication of carrying out the conversion of literary composition language to media resource processing arrangements 4.

Particularly, H.248 media resource controlling device 3 is carrying the expanding packet parameter in the message by defining H.248 protocol extension bag, thereby indication media resource processing arrangements 4 is carried out the conversion of literary composition language.H.248 protocol package is defined as follows:

Bag title (Package Name): TTS bag (TTS Package)

Bag sign (PackageID): ttsp (0x?)

Illustrate: slightly, referring to the explanation of follow-up scheme

Version (Version): 1

Expansion (Extends): do not have

1. characteristic (Properties):

Do not have

2. incident (Events):

With reference to the definition in follow-up " incident " part.

3, signal (Signals)

With reference to follow-up definition in " signal " part.

4. statistical information (Statistics)

Do not have

5. handle (Procedure)

The corresponding follow-up flow process that will describe.

In step 1, can adopt multiple mode in the parameter of message H.248, to carry the text string relevant information:

(1) in the parameter of message H.248, carry text string:

"

H.248 the identification of the functional entity of the not processed H.248 agreement of the form of text string just be embedded in the message as a string.After media resource processing arrangements 4 receives this parameter, can directly extract text string and give the processing of literary composition language transducer.

(2) H.248 carrying text string file identification and stored position information in the message parameter

Text string can be stored on media resource processing arrangements 4 or the external server in advance, H.248 carries the sign and the stored position information of text string file in the message.

The sign of text string file can be the arbitrary string that meets the file designation standard.

The stored position information of text string file has three kinds of forms:

But the file of this locality direct access I. is as welcome.txt;

II. pass through file: the file of // mode access, as file: //huawei/welcome.txt;

III. pass through http: the file of // mode access, as htttp: //huawei/welcome.txt;

After media resource processing arrangements receives this parameter,, from far-end server or local storage, read text earlier, put into buffer memory, call literary composition language transducer again and handle according to the deposit position of text string file.

(3) H.248 carrying text string and text simultaneously in the message parameter, text string and combination of files are carried out

Text sign and text string are combined into a continuous text string, increase special keyword in text sign front, a pronunciation text is introduced in expression, rather than directly changes this filename, as:

<importtextfile?http://huawei/welcome.txt>

Do?you?want?to?play?a?game？

After media resource processing arrangements 4 receives the combination fill order of pronunciation text string and text string file, carry out preliminary treatment earlier, read the text string file from external server or in this locality, and the pronunciation text string that carries in itself and the message is connected to become a string, put into buffer memory, call literary composition language transducer again and handle.

(4) after indication is done literary composition language conversion process to a text string or text, be combined into another voice segments with the recording fragment again

Increase special keyword in voice document sign front, a voice document is introduced in expression, rather than directly changes this filename, as:

<importaudiofile?http://huawei/welcome.g711>

Do?you?want?to?play?a?game？

Media resource processing arrangements 4 carries out preliminary treatment earlier after receiving the combination fill order of literary composition language converting speech and recording file, reads file from far-end server or this locality, puts into buffer memory; Call literary composition language transducer again and handle text string, and the output voice and the voice document of literary composition language conversion is combined into a sound bite.

In addition, in step 1, further H.248 carrying the voice attributes parameter that the conversion of literary composition language is exported in the message.When the indication media resource processing arrangements was carried out the conversion of literary composition language, the portable parameter relevant with pronunciation had:

(1) pronunciation category of language

Can use different category of language, defer to the definition of RFC3066.

(2) pronunciation sex

Can be male voice, female voice or neutral sound;

(3) the pronunciation age

Can be child's sound, adult's sound or old sound;

(4) rate of articulation

The rate of articulation can be faster or slower than normal word speed, represents with percentage, and-20% expression is than normal speed slow 20%.

(5) pronunciation volume

The pronunciation volume can be higher or lower than normal pitch, represents with percentage, and-20% expression is than normal pitch low 20%.

(6) audio tone

Audio tone can be higher or lower than normal pitch, represents with percentage, and-20% expression is than normal pitch low 20%.

(7) articulation type of special character

To the special word regulation articulation type in the text string.Pronunciation as " 2005/10/01 " is " on October 1st, 2005 ".

(8) whether pause and the duration that pauses, stall position

The purpose of pausing is to pronounce to be accustomed in order to meet, and the pause duration is a time value greater than 0, and stall position can have several values: pause after whenever running through in short, perhaps pause after whenever running through one section word.

(9) whether read again and stressed rank, stressed position

Stressed rank can be high, medium and low three ranks; Can there be several values the position of reading again: only read again when beginning in full, the beginning of every words is all read again, and the beginning of every section words is all read again etc.

(10) whether read text in advance

If file is read in indication in advance, then after receiving order, the remote server of arriving reads file cache in this locality, otherwise by the time reads file during command execution again;

(11) duration of file cache

After file read this locality, buffer memory how long lost efficacy by the back.

(12) whether when the user imports DTMF or voice, end the conversion of literary composition language.

At literary composition language conversion and automatic speech/when DTMF identification is carried out simultaneously, when if the user imports DTMF or voice, the literary composition language is changed and can be ended in the literary composition language transfer process.

Step 2, media resource processing arrangements are confirmed this indication after the indication that receives media resource controlling device, confirmation is fed back media resource controlling device, and call literary composition language transducer and carry out the conversion of literary composition language, the voice after the user plays conversion.

Particularly, H.248 defining in the protocol package:

Signal (Signal) comprising: the signal of TTS file is play in (1) indication; (2) signal of TTS string is play in indication; (3) signal of TTS string, TTS file and voice snippet is play in indication; (4) indication is provided with the signal of stress; (5) indication is provided with the signal of pause; And the signal of the special words of (6) indication, these signals are expressed as follows respectively:

(1) plays TTS file (Play TTS File), be used for indication and carry out function of text-to-speech convert.

Signal name (Signal Name): play TTS file (Play TTS File)

Signal identification (SignalID): ptf (0x?)

(Description) is described: the text string file is carried out the TTS function

Signal type (SignalType): BR

Duration (Duration): unavailable (Not Applicable)

Its additional parameter (Additional Parameter) comprising:

I.

Parameter name (Parameter Name): TTS file

Parameter identification (Parameter ID): tf (0x?)

Illustrate: TTS filename and memory location

Type (Type): character string (String)

Whether optional (Optional): not

Possible value (Possible Value): legal file identification and storage format

Default value (Default): do not have

II.

Parameter name: language form (Language Type)

Parameter identification: lt (0x?)

Illustrate: language form

Type: character string

Whether optional: not

Probable value: defer to the RFC3066 agreement

Default value: do not have

III.

Parameter name: sex (Gender)

Parameter identification: ge (0x?)

Illustrate: the pronunciation sex

Type: character string

Whether optional: not

Probable value: man, woman, neutrality

Default value: do not have

IV.

Parameter name: age (Age)

Parameter identification: ag (0x?)

Illustrate: the pronunciation age

Type: character string

Whether optional: not

Probable value: child, adult, old man

Default value: do not have

V.

Parameter name: speed (Speed)

Parameter identification: sp (0x?)

Illustrate: the rate of articulation

Type: integer

Whether optional: yes

Probable value: the value from-100% to 100%

Default value: do not have

VI.

Parameter name: volume (Volume)

Parameter identification: vo (0x?)

Illustrate: the pronunciation volume

Type: integer

Whether optional: as to be

Probable value: the value from-100% to 100%

Default value: do not have

VII.

Parameter name: tone (Tone)

Parameter identification: to (0x?)

Illustrate: audio tone

Type: integer

Whether optional: as to be

Probable value: the value from-100% to 100%

Default value: do not have

VII.

Parameter name: read file (Prefetch) in advance

Parameter identification: pf (0x?)

Illustrate: read the text string file in advance

Type: enum

Whether optional: as to be

Probable value: be, not

Default value: be

VIII.

Parameter name: cache-time (Cache Time)

Parameter identification: ct (0x?)

Illustrate: the file cache duration

Type: integer

Whether optional: as to be

Probable value: greater than 0 second

Default value: do not have

IX.

Parameter name: DTMF inserts

Parameter identification: dbi (0x?)

Illustrate: when the user imports DTMF, end the conversion of literary composition language

Type: enum

Whether optional: as to be

Probable value: be, not

Default value: do not have

X.

Parameter name: voice barge in

Parameter identification: vbi (0x?)

Illustrate: when user importer voice, end the conversion of literary composition language

Type: integer

Whether optional: as to be

Probable value: greater than 0 second

Default value: do not have

(2) play TTS string (Play TTS String), be used for indication text string is carried out the TTS function.

Signal name: play the TTS string

Signal identification: pts (0x?)

Illustrate: indication is carried out the TTS function to text string

Signal type: BR

Duration: unavailable

Its additional parameter comprises:

I.

Parameter name: TTS goes here and there (TTS String)

Parameter identification: ts (0x?)

Illustrate: the text string that can pronounce

Type: character string

Whether optional: not

Probable value: the text string that can pronounce

Default value: do not have

II. other parameter is identical with II, III, IV, V, VI, IX, the X of " playing the TTS file " signal.

(3) play TTS string, TTS file and voice snippet

Signal name: play combination (Play union)

Signal identification: pu (0x?)

Illustrate: the combination of playing TTS string, TTS file, sound bite file

Signal type: BR

Duration: unavailable

Its additional parameter comprises:

I.

Parameter name: TTS and voice snippet

Parameter identification: ta (0x?)

Illustrate: the combination of playing TTS string, TTS file, sound bite file

Type: character string

Optional No whether

Probable value: the combination of playing TTS string, TTS file, sound bite file

Default value: do not have

II. other parameter is identical with II, III, IV, V, VI, IX, the X of " playing the TTS file " signal.But II, III, IV, V, VI parameter only are applicable to the TTS transfer process.

(4) stress (Set Accentuation), the stressed rank and the position that are used to indicate TTS are set.

Signal name: stressed (Set Accentuation) is set

Signal identification: sa (0x?)

Illustrate: stressed rank and the position of indication TTS

Signal type: BR

Duration is unavailable

Its additional parameter comprises:

I.

Parameter name: read position (Accentuation Position) again

Parameter identification: ap (0x?)

Illustrate: read the position again

Type: character string

Whether optional: as to be

Probable value: starting position, sentence beginning, paragraph beginning

Default value: do not have

II.

Parameter name: read rank (Accentuation Grade) again

Parameter identification: ag (0x?)

Illustrate: read rank again

Type: character string

Whether optional: as to be

Probable value: height, in, low

Default value: do not have

(5) pause (Set Break), the stall position and the duration that are used to indicate TTS are set.

Signal name: pause (Set Break) is set

Signal identification: sb (0x?)

Illustrate: stall position and the duration of indication TTS

Type signal: BR

Duration is unavailable

Its additional parameter comprises:

I.

Parameter name: stall position (Break Position)

Parameter identification: bp (0x?)

Illustrate: stall position

Type: character string

Whether optional: not

Probable value: the ending of sentence, the ending of paragraph

Default value: do not have

II.

Parameter name: pause duration (Break Time)

Parameter identification: bt (0x?)

Illustrate: the pause duration

Type: integer

Whether optional: yes

Probable value: greater than 0 millisecond

Default value: do not have

(6) special words (Special Words) is used to indicate the manner of articulation of TTS to special words.

Signal name (Signal Name): special words

Signal identification (SignalID): sw (0x?)

Illustrate: indication TTS is to the manner of articulation of special words

Type signal: BR

Duration is unavailable

Its additional parameter parameter comprises:

I.

Parameter name: target words (Destination Words)

Parameter identification: dw (0x?)

Illustrate: the original words in the text string

Type: character string

Whether optional: as to be

Probable value: any

Default value: do not have

II.

Parameter name: replace pronunciation (Say As)

Parameter identification: sa (0x?)

Illustrate: the manner of articulation of replacement

Type: character string

Whether optional: as to be

Probable value: any

Default value: do not have

Step 3, media resource controlling device 3 indication media resource processing arrangements detect literary composition language transformation result.

Step 4, media resource processing arrangements 4 are confirmed and are returned confirmation after receiving this indication.

Step 5,3 pairs of literary composition languages of media resource controlling device transfer process is controlled, and this control comprises:

1, suspends: temporarily stop the user being play voice after the conversion;

2, recover: recover above halted state to broadcast state;

3, there is multiple indicating means F.F. and the position that is fast-forward to:

(1) several words of F.F.;

(2) be fast-forward to the beginning of a certain sentence in back;

(3) be fast-forward to a certain section beginning in back;

(4) F.F. is some seconds;

(5) the some phonetic units of F.F. (phonetic unit is self-defined by realizing, as 10s).

4, there is multiple indicating means the position of rewind down and rewind down:

(1) several words of rewind down;

(2) fall back on a certain sentence beginning in front soon;

(3) fall back on a certain section beginning in front soon;

(4) rewind down is some seconds;

(5) the some phonetic units of rewind down (phonetic unit is self-defined by realizing, as 10s).

5, restart the conversion of literary composition language;

6, literary composition language EOC: the user ends

7, the scope that repeats and repeat has multiple indicating means:

(1) repeats current sentence;

(2) repeat present segment;

(3) repeat in full;

8, cancellation repeats: cancel above-mentioned repeat playing;

9, reset literary composition language conversion parameter, comprise above-mentioned tone, volume, velocity of sound, pronunciation sex, pronunciation age, read parameters such as position, stall position and duration again.

Particularly, H.248 be defined as in the protocol package:

Signal: comprise TTS suspend,

(1) TTS suspends (TTS Pause), is used for indication and suspends TTS.

Signal name: TTS suspends (TTS pause)

Signal identification:: tp (0x?)

Illustrate: indication suspends TTS

Type signal: BR

Duration: unavailable

Additional parameter: do not have

(2) TTS recovers (TTS Resume), is used for indication and recovers the TTS time-out.

Signal name: TTS recovers (TTS Resume)

Signal identification: tr (0x?)

Illustrate: indication recovers TTS and suspends

Type signal: BR

Duration is unavailable

Additional parameter: do not have

(3) TTS skips words (TTS Jump Words), is used for proceeding after several words are skipped in indication.

Signal name: TTS skips words

Signal identification: tjw (0x?)

Illustrate: indication is jumped to some positions and is proceeded

Type signal: BR

Duration: unavailable

Additional parameter:

I.

Parameter name: skip what (Jump Size)

Parameter identification: js (0x?)

Illustrate: the word number of skipping, just representing that backward negative indication is forward

Type: integer

Whether optional: not

Probable value: any

Default value: do not have

(4) TTS skips sentence (TTS Jump Sentences), and being used for indication, to skip several sentences follow-up

Continuous carrying out.

Signal name: TTS jump sentences

Signal identification: tjs (0x?)

Illustrate: proceed after several sentences are skipped in indication

Type signal: BR

Duration: unavailable

Additional parameter comprises:

I.

Parameter name: what are skipped

Parameter identification: js (0x?)

Illustrate: the sentence number of redirect, just representing that backward negative indication is forward

Type: integer

Whether optional: not

Probable value: any

Default value: do not have

(5) TTS skips paragraph (TTS Jump Paragraphs), is used for proceeding after several paragraphs are skipped in indication.

Signal name: TTS skips paragraph

Signal identification: tjp (0x?)

Illustrate: proceed after several paragraphs are skipped in indication

Type signal: BR

Duration: unavailable

Additional parameter comprises:

I.

Parameter name: what are skipped

Parameter identification: js (0x?)

Illustrate: the paragraph number of redirect, just representing that backward negative indication is forward

Type: integer

Whether optional: not

Probable value: any

Default value: do not have

(6) TTS skips a second number (TTS Jump Seconds), proceeds after being used to indicate the voice of skipping several seconds.

Signal name: TTS skips a second number

Signal identification: tjs (0x?)

Illustrate: indication is skipped and was proceeded behind the voice in several seconds

Type signal: BR

Duration: unavailable

Additional parameter comprises:

I.

Parameter name: what are skipped

Parameter identification: js (0x?)

Illustrate: the second number of redirect, just representing that backward negative indication is forward

Type: integer

Whether optional: not

Probable value: any

Default value: do not have

(7) TTS skips voice unit (TTS Jump Voice Unit), is used for proceeding after several voice units are skipped in indication.

Signal name: TTS skips voice unit

Signal identification: tjvu (0x?)

Illustrate: proceed after several voice units are skipped in indication, voice unit is big

Little realization is self-defined

Type signal: BR

Duration: unavailable

Additional parameter comprises:

I.

Parameter name: what are skipped

Parameter identification: js (0x?)

Illustrate: the voice unit number of redirect, just representing that backward negative indication is forward

Type: integer

Whether optional: not

Probable value: any

Default value: do not have

(8) TTS restarts (TTS Restart)

Signal name: TTS restarts

Signal identification: tr (0x?)

Illustrate: TTS restarts

Type signal: BR

Duration: unavailable

Additional parameter: do not have

(9) TTS finishes (TTS End)

Signal name: TTS finishes

Signal identification: te (0x?)

Illustrate: TTS finishes

Type signal: BR

Duration: unavailable

Additional parameter: do not have

(10) TTS repeats (TTS Repeat), and indication repeats a certain section literal of TTS.

Signal name: TTS repeats

Signal identification: tre (0x?)

Illustrate: a certain section literal that repeats TTS

Type signal: BR

Duration: unavailable

Additional parameter comprises:

I.

Parameter name: repeatable position

Parameter identification: pos (0x?)

Illustrate: repeatable position

Type: character string

Whether optional: not

Probable value: current sentence, current paragraph, all the elements

Default value: do not have

Whether optional: yes

Probable value: greater than 0 second

Step 6, media resource processing arrangements 4 are confirmed and are returned confirmation after receiving this indication.

Step 7, media resource processing arrangements 4 will detected incident such as normal terminations in transfer process spoken in literary composition, and overtime grade feeds back to media resource controlling device 3.

The detected incident of literary composition language transfer process comprises: the parameter of describing the result when error code under the abnormal conditions and normal conversion finish.

1, the error code of function of text-to-speech convert execution

Media resource processing arrangements if generation is unusual, return concrete error code to media resource controlling device in carrying out literary composition language transfer process.The occurrence of error code is distributed unitedly by normal structure, and content comprises:

(1) word that can not discern or word;

(2) unpronounceable word;

(3) the text string file does not exist;

(4) text string file read error;

(5) parameter is not supported or mistake;

(6) control of literary composition language conversion is not supported or mistake;

(7) media resource processing arrangements hardware error;

(8) media resource processing arrangements software error;

(9) other mistake.

2, the description result's who returns after the literary composition language conversion normal termination parameter

During literary composition language conversion normal termination, can return following information:

(1) literary composition language transfer process normal termination;

(2) user imports and triggers literary composition language conversion termination: the user imports abort key, and the user imports DTMF, user input voice.

(3) statistical information: to the literary composition language converting speech duration of user's broadcast.

Specific as follows:

Incident:

(1) TTS carries out failure (TTS Failure)

Event name (Event Name): TTS carries out failure

Event identifier (EventID): ttsfail (0x?)

Illustrate: failure is carried out in the conversion of literary composition language, returns error code

Event Description parameter (EventDescriptor Parameters) does not have

Detected event argument (ObservedEventDescriptor parameters) comprising:

I.

Parameter name: mistake return code (Error Return Code)

Parameter identification: erc (0x?)

Illustrate: the error code parameter

Parameter type: integer

Whether optional: not

Probable value: the error code of above scheme definition

Default value: do not have

(2) TTS complete (TTS Success)

The incident title: TTS is complete

Event identifier: ttssuss (0x?)

Illustrate: the conversion of literary composition language is complete, return results

Event Description parameter: do not have

Detected event argument (ObservedEventDescriptor parameters) comprising:

I.

Parameter name: finish reason (End Cause)

Parameter identification: ec (0x?)

Illustrate: the reason that triggers literary composition language EOC

Type: integer

Whether optional: as to be

Probable value: convert, the user imports DTMF, user input voice

Default value: do not have

II.

Parameter name: TTS time (TTS Time)

Parameter identification: tt (0x?)

Illustrate: the duration of carrying out the conversion of literary composition language

Type: integer

Whether optional: as to be

Probable value: greater than 0 second

Default value: do not have

Step 8, media resource controlling device 3 feeds back to media resource processing arrangements 4 with acknowledge message, literary composition language EOC.

Be understandable that the present invention is not limited to the above embodiments, those skilled in the art can change accordingly or modify on understanding basis of the present invention.For example, media resource controlling device 3 can be simultaneously sent indication in above-mentioned steps 1 and the step 3 to media resource processing arrangements 4, and media resource processing arrangements 4 operation in execution in step 2 and the step 4 simultaneously.

Claims

1. a method that realizes function of text-to-speech convert is characterized in that, media resource controlling device is passed through H.248 agreement, and control media resource processing arrangements realization literary composition language is changed, and this method may further comprise the steps:

Step 2, media resource processing arrangements is carried out literary composition language conversion process according to the literary composition language of the parameter call in above-mentioned message transducer, and with the literary composition transformation result feedback media resource controlling device of speaking;

Carry the relevant information of text string in this expanding packet parameter, media resource processing arrangements calls literary composition language transducer and carries out the conversion of literary composition language according to the relevant information of text string;

The relevant information of above-mentioned text string is a text string itself, and H.248 it as being embedded in the message by the character string of orthoepy, after media resource processing arrangements receives text string, directly extracts text string and calls literary composition language transducer and carry out literary composition language and change.

2. the method for claim 1, it is characterized in that, when text string is stored on media resource processing arrangements or the external server in advance, the relevant information of above-mentioned text string is to comprise the sign of text string and the text of stored position information, after media resource processing arrangements receives above-mentioned text, according to stored position information wherein, from this locality or external server read text string and put into buffer memory, and call literary composition language transducer and carry out literary composition language and change.

3. the method for claim 1, it is characterized in that, the relevant information of above-mentioned text string comprises the text of text string and another text string, text file comprises the sign and the stored position information of this another text string, the sign of text file and text string are combined into the continuous text string, and before text sign, increase keyword and identify this and be combined as the text that pronounces, after media resource processing arrangements receives this combination, at first from this locality or external server read text string, put into buffer memory after itself and the pronunciation text string that H.248 carries in the message be connected in series, call literary composition language transducer then and carry out literary composition language and change.

4. the method for claim 1, it is characterized in that, the relevant information of above-mentioned text string comprises the combination of a text string and a recording file, and before text string, increase keyword and identify this and be combined as a voice document, after media resource processing arrangements receives this combination, at first call literary composition language transducer text string is carried out the conversion of literary composition language, voice and the recording file with literary composition language conversion back output makes up a voice snippet then.

5. the method for claim 1, it is characterized in that, the relevant information of above-mentioned text string comprises the combination of a text and a recording file, text file comprises the sign and the stored position information of this another text string, and before this sign, increase keyword and identify this and be combined as a voice document, after media resource processing arrangements receives this combination, at first according to stored position information from this locality or external server read text string and put into buffer memory, call literary composition language transducer then the text string that reads is carried out the conversion of literary composition language, and the voice and the recording file of literary composition language conversion back output made up a voice snippet.

6. the method for claim 1, it is characterized in that, H.248, this further carries the relevant parameter that the voice attributes of output changed in the literary composition language in message, this relevant parameter comprises: pronunciation category of language, pronunciation sex, pronunciation age, pause, media resource processing arrangements receive calls literary composition language transducer behind this relevant parameter and for the voice of output corresponding attribute is set.

7. method as claimed in claim 6, it is characterized in that, H.248, this further carries the relevant parameter that the voice attributes of output changed in the literary composition language in message, this relevant parameter comprises at least one in the following parameter: the articulation type of the rate of articulation, pronunciation volume, audio tone, special character, read, whether end the conversion of literary composition language again when the user imports, media resource processing arrangements receives and calls literary composition language transducer behind this relevant parameter and for the voice of output corresponding attribute is set.

8. as any one described method of claim 1 to 7, it is characterized in that media resource processing arrangements calls in the literary composition language transducer execution literary composition language transfer process in step 2, further comprises:

9. method as claimed in claim 8 is characterized in that, when detecting anomalous event, media resource processing arrangements will represent that anomalous event corresponding error sign indicating number feeds back to media resource controlling device.

10. method as claimed in claim 8 is characterized in that, media resource processing arrangements calls in the literary composition language transducer execution literary composition language transfer process in step 2, further comprises:

11. method as claimed in claim 10 is characterized in that, media resource controlling device comprises the control of literary composition language transfer process and temporarily stops the user being play voice after the conversion of literary composition language.

12. method as claimed in claim 11 is characterized in that, media resource controlling device further comprises from above-mentioned halted state the control of literary composition language transfer process and returns to broadcast state.

13. method as claimed in claim 10, it is characterized in that, media resource controlling device comprises making to the control of literary composition language transfer process plays F.F. or rewind down, this F.F. comprises the some words of F.F., sentence or paragraph, perhaps F.F. is some seconds, this rewind down comprises the some words of rewind down, sentence or paragraph, and perhaps rewind down is some seconds.

14. method as claimed in claim 10 is characterized in that, media resource controlling device comprises that to the control of literary composition language transfer process restarting the literary composition language changes.

15. method as claimed in claim 10 is characterized in that, media resource controlling device comprises that to the control of literary composition language transfer process the user ends the conversion of literary composition language.

16. method as claimed in claim 10 is characterized in that, media resource controlling device comprises the current sentence of repeat playing, paragraph or full text to the control of literary composition language transfer process.

17. method as claimed in claim 16 is characterized in that, media resource controlling device comprises further that to the control of literary composition language transfer process cancellation is to current sentence, paragraph or repeat playing in full.

18. method as claimed in claim 10 is characterized in that, media resource controlling device comprises the control of literary composition language transfer process and stops the user being play voice after the conversion of literary composition language.