CN102088505B

CN102088505B - Voice message leaving and conveying system and method

Info

Publication number: CN102088505B
Application number: CN200910247193XA
Authority: CN
Inventors: 郭志忠; 简世杰; 邱中人; 张信常
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 2009-12-02
Filing date: 2009-12-02
Publication date: 2013-11-27
Anticipated expiration: 2029-12-02
Also published as: CN102088505A

Abstract

The invention discloses a voice message leaving and conveying system and a voice message leaving and conveying method. The voice message leaving and conveying system is used for automatically resolving the input voice of at least one message leaving person, extracting multiple items of information and conveying message information to at least one message receiving person according to the desired conveying condition of the message leaving person; a command or information parser is used for parsing the multiple items of information which at least comprises the identity of the at least one message leaving person, at least one message conveying command and at least one message information voice from a message voice; the multiple items of information is output to a message synthesizer so as to synthesize a conveying message voice; and a conveying controller is used for controlling a device switching component according to the identity of the at least one message leaving person and the at least one message conveying command, so that the conveying message voice is sent to the at least one message leaving person through one of at least one message sending device.

Description

The System and method for of tone information and reception and registration

Technical field

The present invention relates to the System and method for of a kind of tone information and reception and registration (leaving and transmitting speechmessages).

Background technology

Message and reception and registration are activities common in daily life, and common operational mode has comprised modes such as writing scratch paper, Email, telephone message and recorder, in the utilization of this class, and writer and the common non-same people of the person of being communicated.Another kind of operational mode such as calendar (Calendar) or electronic calendar (Electronic Calendar), the oneself's prompting of take is main application, for example leaves word for own and conveys to oneself.No matter above-mentioned which kind of application; the message content is not at once to be processed usually; the people (person of being communicated) who also therefore receives message often can be because some reason be forgotten the message that message is explained, or because of the relation of Location, the people who receives message can't receive message in good time.Therefore, improve message and the effect of passing on, at reasonable time, being conveyed to by suitable transmission pipeline the people who receives message can be solution preferably.

This message with pass on also can be applicable to initiatively show loving care for or home care on, for example, in family, the older need take medicine and reminds or enjoin the pupil need complete homework etc. on time.This message and reception and registration are used on the domestic robot that possesses ability to act, present society can use especially, if in conjunction with the ability to act of robot, voice mail message can be conveyed to member in family in more appropriate mode, and reach one of purpose of initiatively showing loving care for and looking after.

Many prior aries about tone information and reception and registration are arranged, for example, in the document of U.S. Patent number US6324261, disclosed the hardware structure of a kind of announcement record and broadcasting, that the collocation transducer operates, do not leave a message and resolve or restructuring etc., its operation is implemented with hardware button, and non-active broadcasting.In a kind of message reception and registration system with mutual communication function that U.S. Patent number 7327834 discloses, the mode of operation of its use needs the user clearly to define the projects such as receiver, time on date and event information and transmission message.

In the document of U.S. Patent number 7394405, disclosed a kind of and message informing system " System For Providing Location-Based Notifications " Regionalization.As shown in the example of Fig. 1, in the vehicle 102 that is equiped with this message informing system, its action need user inputs header information (header information) 104, define notice kenel, validity date (expiration date), importance (importance) and voice record (speech recording) 106, and the ground domain detection device of arranging in pairs or groups (location detection device) decides the geographical position at the current place of input unit of notification message as GPS.When the geographical position at the current place of input unit and the position 110 of passing on notification message approach to a threshold value (threshold) scope 108 when interior, pass on notification message.

In the document of Chinese patent application numbers 2006101242963, a kind of voice record of intelligent residence based on speech recognition technology and system for prompting have been disclosed.As shown in the example of Fig. 2, this system comprises a phonetic incepting module 210, a

system control module

220, and 230 3 parts of a voice output module.This system is according to the predefine rule, the voice signal that the user is sent carries out identification, whether differentiate is to control voice or information speech, and speech data is carried out personalisation process and passes on the user, thereby realize directly by voice, controlling, complete the functions such as automatic message-recording, diary and appoint reminder.Given two control voice in operation, namely start message and finish message, in two control voice folded be information speech.

In the document of Taiwan patent No. I242977, disclosed a kind of voice calendar system.As shown in the example of Fig. 3, voice calendar system 300 comprises an Internet server 311, a compuphone integrated service device 312 and a voice synthesizing server 313, server 311,312,313 all is connected to a telecommunication network 31, is a kind of voice calendar system that between internet and communication network, message is transmitted of processing.Internet server 311 is connected to internet 32, to process the communication operations of Internet user 34 and system 300, transmitting-receiving as Email, this Email comprises delegation's thing and goes through event, this calendar event comprises notification message and setting-up time, the notification message voice document that can be message language or pre-record wherein, and message language is with speech synthesis technique synthetic speech file, and voice document is played in the communication of communication network 33.Compuphone integrated service device 312 is connected to communication network 33, to process the call response of subscribers to telecommunication network 35 and system 300.

The explanation of comprehensive above-mentioned and other a plurality of prior art documents, modal message with pass on be the user according to the rule input message of predefined with convey a message, comprise the mode of receiver, time on date, event information and transmission message etc.; Next is to utilize speech recognition, according to predefined regular input voice information.

Summary of the invention

Enforcement example of the present invention can provide the System and method for of a kind of tone information and reception and registration.

In one implements example, what disclose is the system about a kind of tone information and reception and registration, this system comprises order or message parser (command or message parser), a transfer control (transmitting controller), a message synthesizer (message composer) and an at least one message conveyer (message transmitting device), and order or message parser are connected to respectively transfer control and message synthesizer.Order or message parser are from least one writer's (reminder) input voice, dissect out multinomial information (tag information), comprise that at least at least one writer's identity (reminderID), at least one message are transmitted an order (transmitted command) and at least one voice mail message voice (speech message); The message synthesizer synthesizes the voice (transmitted message speech) that get the message across by this multinomial information; Transfer control is transmitted an order according at least one writer's identity and at least one message, control a device changeover module (device switch), make to get the message across voice via the message conveyer at least one message conveyer, and be sent at least one reception writer.

At another, implement in example, what disclose is a kind of method about tone information and reception and registration, the method comprises: from least one writer's input voice, dissect out multinomial information, this multinomial information comprises that at least at least one writer's identity, at least one message are transmitted an order and at least one voice mail message voice; By this multinomial information synthetic one voice that get the message across; And according to this at least one message language person identity therewith at least one message transmit an order, control a device changeover module, make this message conveyer of voice at least one message conveyer thus that get the message across, and be sent at least one reception writer.

Now in conjunction with following accompanying drawing, implement detailed description and the claim of example, after above-mentioned and other advantage of the present invention is specified in.

The accompanying drawing explanation

Fig. 1 is a kind of and an example schematic of the message informing system of Regionalization.

Fig. 2 is an example schematic of a kind of voice record of intelligent residence based on speech recognition technology and system for prompting.

Fig. 3 is an a kind of example schematic of voice calendar system.

Fig. 4 is an example schematic of the system of tone information and reception and registration, consistent with some the enforcement example disclosed.

Fig. 5 is illustrated in message and passes on two stage running with a work example, consistent with some the enforcement example disclosed.

Fig. 6 A to Fig. 6 D is the running example of several reception and registration and feedback, consistent with some the enforcement example disclosed.

Fig. 7 is an example schematic, and the structure of order or message parser is described, consistent with some the enforcement example disclosed.

Fig. 8 A to Fig. 8 C is three kinds of example configuration diagram that realize the voice content acquisition device, consistent with some the enforcement example disclosed.

Fig. 9 is an example schematic of the data structure of mixed state word, consistent with some the enforcement example disclosed.

Figure 10 is an example configuration diagram of word content analyzer, consistent with some the enforcement example disclosed.

Figure 11 is an example schematic, illustrates with an example that mixes the state word how the image sequence recombination module updates and analyze the content of mixed state word, consistent with some the enforcement example disclosed.

Figure 12 is an example schematic, illustrates how image sequence selection module carries out the calculating of concept scores by image sequence, consistent with some the enforcement example disclosed.

Figure 13 A to Figure 13 C confirms several output of interface and the example schematic of input, consistent with some the enforcement example disclosed.

Figure 14 is illustrated in the running of transfer control with a work example, consistent with some the enforcement example disclosed.

Figure 15 is the example that holds Figure 14, illustrates when the reception and registration condition is not satisfied, and the running of transfer control, consistent with some the enforcement example disclosed.

Figure 16 is an example schematic of message synthesizer, consistent with some the enforcement example disclosed.

Figure 17 is an example schematic, illustrate and be not satisfied and can't complete in the mode of " message specify " while passing on when the reception and registration condition, the running of message synthesizer, with disclose some to implement example consistent.

Figure 18 is after a plurality of writers input tone information, and while conveying to single message object, the message synthesizer carries out a synthetic example schematic of sentence, consistent with some the enforcement example disclosed.

Figure 19 is an exemplary flowchart, and tone information and method for conveying are described, consistent with some the enforcement example disclosed.

The reference numeral explanation

102 vehicle 104 header informations

106 voice record 108 threshold ranges

110 pass on the position of notification message

210 phonetic incepting module 220 system control modules

230 voice output modules

300 voice calendar system 311 Internet servers

312 compuphone integrated service device 313 voice synthesizing servers

31 telecommunication network 32 internets

34 Internet user's 33 communication networks

35 subscribers to telecommunication network

400 messages and reception and registration system 402 writers

404 input voice 410 order or message parsers

412 writer's identity 414 messages are transmitted an order

416 voice mail message voice 420 transfer controls

430 message synthesizers 432 voice that get the message across

432a voice mail message 432b presents reconnect message

440 message conveyer 450 device changeover modules

512 mother's 514 message voice

516 multinomial information 522 timers

532 microphone 534 image capture units

536 finger print detection device 538 RFID tag

540 father's 542 mobile phones

552 facility switchings

The mixed state word of 710 voice content acquisition device 712

720 word content analyzers 730 are confirmed interface

812 language person identification module 814 voice identification modules

816 confidence value measurement module 818 language person speech databases

822 carry out acoustic model selects 824 and picks out acoustic model

Corresponding acoustic model 828 acoustic models of 826 language persons add the adjustment parameter

Voice identification module 842 search space that 830 language persons are correlated with

834 speech recognition vocabulary 836 syntax

The corresponding acoustic model of 838 maximum similarity mark 846 language person

The corresponding acoustic model of 848 language person adds the adjustment parameter

1010 image sequence recombination module 1012 concepts are integrated the syntax

1014 example image sequence corpus 1016 image sequences

1018 confidence value 1020 image sequences are selected module

1022 n-gram concept scores 1024 messages or rubbish are differentiated the syntax

The 1026 best concept sequences that formed by meaning of one's words frame

1110 mixed state textual examples 1112 concepts are integrated grammatical example

1114 example image sequence corpus examples

1116 image sequences and confidence value 1118 image sequences and confidence value example

1210 image sequences and corresponding gross score example

1218 best concept sequences and corresponding gross score example

1220 concept table

1310 meaning of one's words frames

1410 messaging database 1420 sound message records

1430 sensing apparatus 1432 video cameras

1434 radio frequency condition discriminating apparatus 1436 timer arrangements

1520 feedback reconnect message 1540 other reception and registration devices

The template number is synthesized in 1610 language generator 1620 language generations

According to storehouse

1622 synthetic template database example 1630 VODER

1632 voice get the message across

1722, synthetic template 1742, the 1744 feedback reconnect messages of 1724 feedback reconnect message

Template database is synthesized in 1720 language generations

1812,1814,1,816 three voice mail message records

The 1842 voice examples that get the message across

1910 voice of the input from least one writer, dissect out and export multinomial information, and this multinomial information comprises that at least at least one writer's identity, at least one message are transmitted an order and at least one voice mail message voice

1920 by this multinomial information synthetic one voice that get the message across

1930 according to this at least one message language person identity therewith at least one message transmit an order, control a device changeover module, make this message conveyer of voice at least one message conveyer thus that get the message across, and be sent at least one reception writer

Embodiment

The System and method for of a kind of tone information and reception and registration can be provided in enforcement example of the present invention.In enforcement example of the present invention, the writer inputs the message voice to system of the present invention in the mode of continuous natural language dialogue, after native system is resolved the message voice automatically, take out multinomial information, comprise as information such as message object, time, event informations, then according to the writer, wish the condition of passing on, such as in the time range of appointment with mode of communication etc., pass on voice mail message to the people who receives message.

Fig. 4 is an example schematic of the system of tone information and reception and registration, consistent with some the enforcement example disclosed.In the example of Fig. 4, message comprises order or message parser 410, a transfer control 420, a message synthesizer 430 and at least one message conveyer 440 with reception and registration system 400, and order or message parser 430 are connected to respectively transfer control 420 and message synthesizer 430.

Order or message parser 410, from least one writer's 402 input voice 404, dissect out multinomial information, comprise that at least at least one writer's identity 412, at least one message transmit an order 414 and at least one voice mail message voice 416.This multinomial information is output to message synthesizer 430, to synthesize the voice 432 that get the message across.

Transfer control 420 transmits an order 414 according to writer's identity 412 and message, control a device changeover module 450, make to get the message across voice 432 via at least one message conveyer 440, such as the message conveyer in message conveyer 1-3 etc., and be transferred into one, receive the writer, for example, get the message across voice 432 if the voice mail message that will be communicated (transmitted message) 432a sends voice mail message 432a to message object 442; If a feedback reconnect message 432b will present reconnect message (feedback message) 432b and present back to writer 402.

When order or 410 couples of at least one writers' 402 of message parser input voice 404 carry out identification, can identify at least one writer's identity 412.And to whole phonetic entry fragment (segment), order or message parser 410 can measure according to the given syntax and voice confidence level, and pick out instruction vocabulary (command word) fragment and have filler (phonetic filler) fragment of phonetic symbol; Again the filler fragment is distinguished to message filler (message filler) fragment and rubbish filler (garbage filler) fragment.From instruction vocabulary fragment, order or message parser 410 can pick out various messages and transmit an order 414.According to message filler fragment, order or message parser 410 can be from capturing at least one voice mail message voice 416 input voice 404.

Message can be divided into two stages with the operation of reception and registration system 400, i.e. message and reception and registration.Fig. 5 illustrates this two stage with a work example, consistent with some enforcement example of the present invention.

When the stage of message, the writer inputs the message voice to system 400, in the example of Fig. 5, mother's 512 input message voice 514, message voice 514 are: " this has been tiped rubbish; remember with father, to say in the past for 6 in the afternoon ", the voice 514 of leaving a message are received by order or message parser 410, and dissect out multinomial information 516 from message voice 514, wherein, this multinomial information 516 includes: (a) writer's identity (being designated as Who), and this example is " mother "; (b) message object identity (being designated as Whom), this example is " father "; (c) writer will leave word for the sound message (being designated as What, hereinafter to be referred as sound message) of message object, and this example is " this has been tiped rubbish "; (d) when (be designated as When) sound message is conveyed to the message object, this example is " before at 6 in afternoon "; (e) by which kind of message mode of communication (being designated as How), sound message is conveyed to the message object, this example is " broadcaster ", is a system default value.Wherein, project (d) with (e) be option (optional), the information of option can give predetermined value (predefined value) automatically by system.To whole phonetic entry fragment, Who, Whom, When and How are the instruction vocabulary fragments picked out; And What, namely sound message, be the message filler fragment picked out.

When order or message parser 410 dissect message information for after multinomial information 516, multinomial information 516 is first passed to transfer control 420, now just complete the operation in message stage.Before multinomial information 516 is delivered to transfer control 420, order or message parser 410 also can first be carried out one and confirm (confirmation) action, guarantee the accuracy of this multinomial information, for example return this multinomial information and require one to confirm response (acknowledgement).

When the stage of reception and registration, after transfer control 420 receives the multinomial information 516 of order or 410 transmission of message parser, first judge whether to meet the condition of above-mentioned project (b), (d).In above-mentioned example, namely judge whether to meet " before at 6 in afternoon " content transmission " broadcaster " to " father " of leave a message.Wherein, Whom (father) and When (before at 6 in afternoon) are two conditions that transfer control 420 must first meet, and after this two condition meets, then carry out the reception and registration of voice mail message by How (broadcaster).The judgement that whether has met this two condition can realize by inner sensing (sensor) device or with the control circuit that outside sensing apparatus is connected.

In above-mentioned example, sensing apparatus is for example timer 522, can be used to judge whether to meet the time conditions " before at 6 in afternoon " that message is passed on.And can be used to sensing, be whether that the sensing apparatus of message object " father " is such as being microphone 532, image capture unit 534, finger print detection device 536, RFID tag 538 etc.But arround microphone 532 sensings arround the voice that receive, image capture unit 534 fechtables image, user can initiatively press finger print detection device 536 so that system acquisition user fingerprint, user initiatively carry and can allow the RFID tag 538 of system identification status, these sense datas can be used to determine whether " father ".Therefore, transfer control 420 can be by inner sensing apparatus or the control circuit be connected with outside sensing apparatus, to learn the reception and registration condition that whether meets Whom and When.

When transfer control 420 is learnt while meeting the reception and registration condition, that is message detected to liking " father ", and the time that message is passed on is in " before at 6 in afternoon ", by aforesaid Who (mother), Whom (father), the information such as What (mother's message voice: " this has been tiped rubbish ") send message synthesizer 430 to, and remove to control a device changeover module (deviceswitch) 450 according to the condition of How (broadcaster), for example, open the facility switching 552 of a correspondence, make the voice 432 that get the message across of message synthesizer 430 synthesizeds can be via the message conveyer of a correspondence at least one message conveyer 440, for example mobile phone 542, be transmitted to the message object, i.e. " father " 540.

In above-mentioned example, after message synthesizer 430 receives Who (mother), the information such as Whom (father), What (" this has been tiped rubbish "), can leave a message synthesizing of voice from multiple synthetic template (template), selecting a kind of synthetic template.Below wherein a kind of voice that get the message across that may synthesize of the voice 432 that get the message across of message synthesizer 430 synthesizeds: " father, be below the message of mother to you: this has been tiped rubbish ".The facility switching 552 that this synthetic speech is opened by transfer control 420, via the message conveyer of a correspondence, for example mobile phone 542, broadcast.Because transfer control 420 has detected message object (father), so this message object (father) just can receive writer (mother's) voice mail message, now just complete the operation in reception and registration stage.

Message of the present invention with pass on except the running of above-mentioned single writer and single message object, also can be applied in single or in many ways pass on and the running example of feedback on.The running example of single writer and many message objects is inputted following tone information to all members in family as, mother: " tomorrow morning 6 make everybody get up ", message object (Whom) wherein be exactly in all members.Fig. 6 A to Fig. 6 D is the running example of several reception and registration and feedback, consistent with some the enforcement example disclosed.Fig. 6 A is man-to-man reception and registration example, wherein, after single writer inputs tone information, conveys to single message object.Fig. 6 B is many-to-one reception and registration example, wherein, after a plurality of writers input tone information, conveys to single message object.Fig. 6 C is many-to-one reception and registration example, wherein, after single writer inputs tone information, conveys to a plurality of message objects.Fig. 6 D is man-to-man reception and registration and feedback example, and wherein, after single writer inputted tone information, the voice that get the message across were feedback reconnect messages, so directly feed back to this writer.

Structure and the running of the modules of message and reception and registration system 400, be described in detail as follows.

Fig. 7 is an example schematic, and the structure of order or message parser is described, consistent with some the enforcement example disclosed.With reference to the example of figure 7, order or message parser 410 comprise a voice content acquisition device (speech content extractor) 710 and one word content analyzer (text contentanalyzer) 720.Voice content acquisition device 710 receives writer 402 input voice 404, and input since then in voice 404 Word message that captures writer's status 412, input vocabulary (word) corresponding to voice and mix with phonetic symbol (phonetic transcription) (mix-type text, hereinafter to be referred as " mix the state word ") 712 and the information of message voice 416.

After mixed state word 712 was delivered to word content analyzer 720, word content analyzer 720 was from analyzing the messages such as aforesaid Whom, When, How 414 (When wherein, How can be options) of transmitting an order mixed state word 712.Writer's status 412, message voice 416 and the message analyzed are transmitted an order and 414 can be directly passed to transfer control 420 or pass to transfer control 420 again after confirming, control processing to pass on.This confirms that action is random, can confirm that these are passed the accuracy of information, can confirm response by as confirmation interface (confirmation interface) 730, carrying out requirement one.

The voice content acquisition device 710 that the present invention discloses can have a variety of frameworks of realizing, for example, shown in the example of Fig. 8 A, can language person identification module (Speaker Identification) 812, one a voice identification module (Automatic Speech Recognition, ASR) 814 and one confidence value measurement module (Confidence Measure, CM) 816 is realized.Wherein, language person identification module 812 and voice identification module 814 receive respectively writer's input voice 404.The data that language person identification module 812 will be inputted 818 li of the language person speech databases of voice 404 and a training in advance compare, and find out and input the most close person of voice 404, to identify writer's identity 412.814 of voice identification modules carry out identification to input voice 404, to produce mixed state word 712.Afterwards, confidence value measurement module 816 is verified these input voice and mixed state word 712, to produce the corresponding confidence value of each mixed state word, and then is captured message voice 416.

The example difference of the example of Fig. 8 B and Fig. 8 A is, language person identification module 812 first carries out language person identification to writer's input voice 404, the language person who identifies is except direct output, also can be used to select the corresponding acoustic model of this language person or acoustic model adds the adjustment parameter, for example carrying out acoustic model selects in 822, from the corresponding acoustic model of language person (acoustic model) 826 or acoustic model, add and adjust parameter (adaptation parameters) 828, pick out acoustic model 824, to offer follow-up voice identification module 814, use, allow the speech recognition rate improve.

The example of Fig. 8 C is to use the voice identification module (Speaker-dependentASR) 830 that a language person is correlated with to process with confidence value measurement module 816.Wherein, the voice identification module 830 that the language person is correlated with, in the search space (Search Space) the 842nd of carrying out speech recognition and using, adds by acoustic model 846 corresponding to the language person of speech recognition vocabulary 834, the syntax 836 and training in advance or acoustic model database institute construction such as adjusting parameter 848 and forms.Then, in search space 842, find out the path of (the maximum likelihood score) 838 that have the maximum similarity mark, path 838 be can follow and corresponding mixed state word 712 and corresponding writer gone to obtain, mother for example, again by confidence value measurement module 816, voice and the mixed state word 712 of leaving a message verified, to produce the mixed corresponding confidence value of state word 712, and then captures message voice 416.

Fig. 9 is an example schematic of the data structure of mixed state word, consistent with some the enforcement example disclosed.In the example of Fig. 9, the data structure of this mixed state word can comprise 8 kinds of label information.In these 8 kinds of label information, _ Date_ represents the date, such as Monday, January, one day etc._ Time_ represents the time, such as a bit, very, ten seconds etc._ cmd_ represents instruction (command), such as saying, say, remind, notice etc._ Whom_ representative message object, such as father, mother, elder brother etc._ How_ represents the message mode of communication, such as making a phone call, post, broadcast etc.In F/S, F represents function word or function word (Function word), means not possess the vocabulary of meaning, such as remember, help I etc.; And S represents stop words (Stop word), divide two classes, common word when the first kind is Web search, Search engine can be ignored these vocabulary, with hoisting velocity, Equations of The Second Kind is forgiven modal particle, adverbial word, preposition, conjunction etc. and is not had a word of meaning, in the example that the present invention discloses, refers to the vocabulary of Equations of The Second Kind, such as in a moment but, wait a moment, general etc._ Filler_ represents filler, such as basic syllable (Basic-syllable), phonetic symbol (Phone), filler language (Filler-word) etc.Language (confirmation word) is confirmed in _ Y/N representative, such as being, right, be not, wrong etc.Confirm that language is that order or message parser 410 are carried out the response after confirming to move.

Word content analyzer 720 is mixed state words 712 of analyzing from voice content acquisition device 710, its analytic process can be from online (online) training or off-line (offline) training, comprise according to the language material of collecting and the syntax and remove to delete message and the non-essential message language of reception and registration application in mixed state word, and update into the image sequence (Concept Sequence) combined with meaning of one's words frame (Semantic Frame).As shown in the example of Figure 10, word content analyzer 720 can comprise a concept sequence recombination module (Concept Sequence Restructure) 1010 and one concept sequence selection module (ConceptSequence Selection) 1020.

After image sequence recombination module 1010 is integrated the syntax (Concept ComposerGrammar) 1012 and example image sequence (Example Concept Sequence) corpus 1014 and message or rubbish and is differentiated the syntax (" Message or Garbage " Grammar) 1024 and update the mixed state word that voice content acquisition device 710 captures by concept, produce all image sequences 1016 that meet the example image sequence, and calculate the confidence value 1018 of all concepts in the rear image sequence of restructuring.Those image sequences 1016 and the confidence value 1018 of gained are sent to image sequence selection module 1020.Image sequence selects module 1020 by n-gram concept scores 1022, picks out one group of best concept sequence 1026 be comprised of meaning of one's words frame.The best concept sequence 1026 confidence value corresponding with it be comprised of meaning of one's words frame can send confirmation interface 730 to.

Figure 11 is an example schematic, illustrates with an example that mixes the state word how image sequence recombination module 1010 updates and analyze the content of mixed state word, consistent with some the enforcement example disclosed.In the example of Figure 11, from the content of the mixed state textual examples 1110 of voice content acquisition device 710 for " _ Filler_Filler_S1 S2 S3 S4 S5_F/S_ remembers _ F/S_ before at 6 in _ When_ afternoon _ F/S_ follows _ father Whom_ _ Cmd_ says _ Filler_S8 S9 S10 S11 (going to tip rubbish) ", image sequence recombination module 1010 is integrated in the syntax 1012 example as 1112 by concept, with example in example image sequence corpus 1014 as 1114, update and produce a plurality of image sequence of example image sequence and confidence values that calculate of meeting, for example, shown in label 1116, wherein, symbol<Del*n > represent the example in example image sequence corpus is carried out to the action of deleting for n time.For example, mixed state word 1110 is integrated grammatical example 1112 by concept, with (the 1.5) _ Filler_When_Whom in example image sequence corpus example 1114, update and carry out the operation of 4 deletions, produce image sequence, reference arrow 1118 indications, that is " before (1.5Del*5) _ at 6 in Filler_S1 S2 S3 S4 S5_When_ afternoon _ fathers Whom_ ".Another that updates example image sequence corpus is operating as<Ins*n >, symbol<Ins*n > representative carries out the action added for n time.Therefore, when the identification mistake occurs in voice content acquisition device 710, follow-uply still can integrate the auxiliary of the syntax 1012 and example image sequence corpus 1014 by concept, obtain the image sequence identical with there is no the identification mistake, and not affected by part misidentification vocabulary or phonetic symbol.

Image sequence recombination module 1010 calculates the corresponding confidence value of these image sequences after producing all image sequences that meet the example image sequence.Calculate example such as the following formula of this confidence value.

Score1 (editor)

=∑ log (P (editor | Concept is non-to be belonged to _ Filler_))+∑ log (P (editor | _ Filler_ belongs to message))+∑ log (P (editor | _ Filler_ belongs to rubbish)),

The image sequence of label 1118 indications of take is example, being calculated as follows of its confidence value:

The confidence value

＝(-0.756)+(-0.756)+(-0.756)+(-0.309)+(-0.790)＝-3.367

After the confidence value of all image sequences and gained was sent to image sequence selection module 1020, as above-mentioned example, Figure 12 illustrated how image sequence selection module carries out the calculating of concept scores by these image sequences, consistent with some the enforcement example disclosed.In Figure 12, it is auxiliary that image sequence selects module 1020 for example can differentiate grammatical information by n-gram concept scores 1022 and message, these image sequences are carried out to the calculating of concept scores, the aforementioned image sequence at 6 " in _ Filler_S1 S2 S3 S4 S5_When_ afternoon before _ fathers Whom_ " of take is example, being calculated as follows of its n-gram concept scores:

Score2 (n-gram concept)

＝log(P(_Filler_|null))+log(P(_When_|_Filler，null))+log(P(_Whom_|_When_，_Filler_，null))

＝log(0.78)+log(0.89)+log(0.98)＝-2.015，

As shown in concept table 1220, in image sequence " 6 in the past _ fathers Whom_ in _ Filler_S1 S2 S3 S4 S5_When_ afternoon ", concept (What) is " S1 S2 S3 S4 S5 ", and its mark is 0.78; Concept (Whom) is " father ", and its mark is 0.89; Concept (When) is " before at 6 in afternoon ", and its mark is 0.98.

These image sequences and corresponding concept scores have been arranged, and then the gross score of each image sequence can be calculated from confidence value and concept scores, and the computation paradigm of this gross score is as follows:

Gross score=w1 * Score1 (editor)+w2 * Score2 (n-gram concept), w1+w2=1 wherein, w1>=0, w2>=0.The image sequence 1118 of take is example, and its gross score is for example 0.5 * (3.367)+0.5 * (2.015)=-2.736.These image sequences and corresponding gross score have been arranged, and as example 1210, image sequence selects module 1020 can select at least one group of best concept sequence be comprised of meaning of one's words frame, sends to and confirms interface 730.The best concept sequence is arrow 1218 indications for example, and having the highest point total number is-2.736.

Confirm that interface 730 is to confirm whether the word content analyzer 720 analysis gained meaning of one's words have clear not (semantic not clear), or whether the meaning of one's words has locating of conflict (conflict), or whether the meaning of one's words meets the demand of message and reception and registration etc.When negating, Figure 13 A to Figure 13 C is several output and the example schematic of input of confirming interface, consistent with some the enforcement example disclosed when said circumstances.As shown in the example of Figure 13 A, the meaning of one's words of the meaning of one's words frame 1310 that interface 730 receives has locating of clear not or conflict if confirm, for example, when the confidence value is between high standard threshold value and substandard threshold value, confirm that interface 730 can require a response message 1310, supplement again the meaning of one's words according to the response message 1310 of receiving.Clearly the meaning of one's words is for example the meaning of one's words that lacks necessary concept not, for example " afternoon, 6 former (When) notified father (Whom) ", and this meaning of one's words lacks necessary concept What, namely sound message.The meaning of one's words of conflict is for example the meaning of one's words that duplicates concept, for example, and in front once dialogue record, concept When is " before at 6 in afternoon ", but in dialogue record at present, concept When is " before six thirty of afternoon ", and this repeats concept When and different contents occurred.

Whether after supplementing the meaning of one's words, for example the meaning of one's words has obtained while meeting message with reception and registration condition (semantic clear), and as shown in the example of Figure 13 B, confirmation interface 730 can be carried out again and confirm 1320, complete and correct to confirm the message content.If be confirmed sure response, confirm that interface 730 can record writer's identity 412, message and transmit an order 414 and the voice mail message such as voice mail message voice 416, and be sent to transfer control 420.If be confirmed negative response, confirm that interface 730 for example can require to re-enter the message voice.

In the example of review Fig. 5, when the stage of reception and registration, transfer control 420 has first judged whether satisfied reception and registration condition after receiving the message of order or 410 transmission of message parser and passing on relevant information.By message conveyer 440, carry out the reception and registration of voice mail message again.Figure 14 is illustrated in the running of transfer control 420 with a work example, consistent with some the enforcement example disclosed.

In the example of Figure 14, transfer control 420 can, by the message of order or 410 transmission of message parser and reception and registration relevant information, be recorded in a messaging database 1410.For example, writer's identity that transfer control 420 will be received " mother (Who) " and message are transmitted an order, comprise " father (Whom) ", " 6 before (When) ", " broadcast (How) ", with " signal 08010530 (What) ", corresponding sound message record 1420 deposits in messaging database 1410.And in sensing apparatus 1430, such as video camera 1432 or radio frequency condition discriminating apparatus 1434 etc., confirm whether father goes back home.When timer arrangement 1436 confirms that the reception and registration condition meets When (before 6), by writer's identity " mother (Who) ", the message object " father (Whom) ", with sound message " signal 08010530 (SpeechMessage) ", etc. information, be sent to message synthesizer 430, and, according to the condition of mode of communication " broadcast (How) ", open corresponding facility switching.

In actual environment, the reception and registration condition in writer's input voice not necessarily can be satisfied, and for example, 6 of fathers stayed out in the past, and in the case, voice mail message possibly can't be apprised of the message object in real time.Therefore, as shown in the example of Figure 15, transfer control 420 for example can utilize the reception and registration of systemic presupposition (preset) sequentially to remove the setting message conveyer, to avoid occurring the situation that voice mail message is not communicated to the message object.For example, the order of the message conveyer that systemic presupposition is used is, when timer arrangement 1436 confirms that the reception and registration condition meets When (before 6), from video camera 1432 or radio frequency condition discriminating apparatus 1434, find that father stays out, transfer control 420 is presented back sound message record 1420, and mode of communication " broadcast (How) " is changed into to " speech short message " of systemic presupposition, and open corresponding facility switching, make the voice that get the message across of message synthesizer 430 synthesizeds, namely present reconnect message (feedback message) 1520, other passes on device (other transmittingdevice) 1540 by Non-Broadcast Multicast Access, and transmit in " speech short message " mode of systemic presupposition, feedback reconnect message 1530 for example can feed back to the writer or send message object " father " to, to guarantee can not omit the voice that get the message across.

That is to say, when the reception and registration condition is not satisfied and can't completes in the mode of " message specify " while passing on, for example, in the time of can't with " broadcast ", passing on sound message to message object " father " in the time of setting, transfer control 420 can be set as the message conveyer mode of communication of " systemic presupposition ", and pass on device 1540 to transmit by other, to guarantee can not omit the voice that get the message across.

After the message synthesizer receives the information 1450 of writer's identity (Who) of transfer control 420 transmission, leave a message object (Whom), sound message (What), by for example language generation technology, these relevant informations are reintegrated, generation meets the sentence of " fidelity, fluency, elegance ", and convert the sentence of generation to the voice 432 that get the message across, then transfer to message conveyer 440, the voice 432 that will get the message across send one to and receive the writer.

Figure 16 is an example schematic of message synthesizer, consistent with some the enforcement example disclosed.As the example of above-mentioned Fig. 4, framework and the running of message synthesizer 430 are described as follows.Message synthesizer 430 comprises a language generator (Language Generator) 1610 and one VODER (Speech Synthesis) 1630 at least.Language generator (Language Generator) 1610 receive writer's identity " mother (Who) " that transfer controls 420 transmit, message object " father (Whom) ", with the information 1450 of sound message " signal 08010530 (Speech Message) ", and from synthetic template (the Language Generation Template of a language generation, LG Template) database 1620, synthetic template database example 1622 for example, in select a synthetic template and carry out the synthetic of sentence.

For example, when the reception and registration condition all is satisfied, language generator 1610 is selected a synthetic template " Whom; Who has stayed following message to you, " What " ", with the example of information 1450, to generate the voice signal of " father; mother has stayed following message to you, " What " ", then synthesize a voice signal by VODER 1630.Afterwards, VODER 1630 is by this voice signal and sound message (What) " signal 08010530 " continue (concatenation), the voice that produce " father; mother has stayed following message to you; " this has gone to rubbish " " (TransmittedMessage) 1632 that get the message across, wherein, " this has gone to rubbish " is the content example of signal 08010530.Voice get the message across and 1632 will be again by the message conveyer, convey to message recipient, the object of for example leaving a message " father (Whom) ".

When the reception and registration condition is not satisfied, for example can't in the mode of " message is specified ", complete while passing in the time of setting, as shown in the example of Figure 17, message synthesizer 430 receives the sound message record 1420 that transfer control 420 is presented back, and from the synthetic template database 1720 of a language generation, select the synthetic template 1722 of a feedback reconnect message and carry out the synthetic of sentence, to synthesize a feedback reconnect message 1742.If transfer control 420 has been set as the message conveyer mode of communication of " systemic presupposition ", for example " speech short message ", can generate synthetic template database 1720 from language, select the synthetic template 1724 of another feedback reconnect message, to synthesize a feedback reconnect message 1744.

The example of Figure 18 is after a plurality of writers input tone information, and while conveying to single message object, message synthesizer 430 carries out a synthetic example schematic of sentence, consistent with some the enforcement example disclosed.With reference to Figure 18, message synthesizer 430 receives three voice

mail message records

1812,1814 and 1816 after dissecting, wherein two writer's identity are respectively " mother " and " younger brother ", the message object is all " father ", and " mother " has two voice mail messages, " younger brother " has a voice mail message.Message synthesizer 430 can be selected the synthetic template that gets the message across from the synthetic template database of a language generation, and by three voice

mail message records

1812,1814 and 1816, synthetic one voice that get the message across, for example, shown in label 1842, that is " father; mother tell you " message 1-1 ", also have " message 1-2 ", younger brother says " message 2 " in addition ".

As mentioned above, the example process description of Figure 19 the present invention tone information and the method for conveying that disclose, with disclose some to implement example consistent.With reference to Figure 19, in step 1910, the input voice from least one writer, dissect out and export multinomial information, and this multinomial information comprises that at least at least one writer's identity, at least one message are transmitted an order and at least one voice mail message voice.In step 1920, by this multinomial information synthetic one voice that get the message across.In step 1930, according to this at least one message language person identity therewith at least one message transmit an order, control a device changeover module, make this message conveyer of voice at least one message conveyer thus that get the message across, and be sent at least one reception writer.Before transmission gets the message across voice, can carry out at least one confirmation action by a confirmation interface, to confirm the accuracy of this multinomial information or these voice that get the message across.

In step 1910, can according to the given syntax and voice confidence level, measure whole this input sound bite, obtain at least one filler fragment that has at least one literal order fragment of high confidence level and have phonetic symbol, also this filler fragment can be distinguished to message filler fragment and rubbish filler fragment.From at least one literal order fragment, obtaining this at least one message, pass on instruction, according to this message filler fragment, can be from the input voice, capturing at least one voice mail message voice.

In step 1920, according to this multinomial information, can, from the synthetic template database of a language generation, select a synthetic template and carry out the synthetic of sentence, to synthesize the voice that get the message across.The synthetic template database of language generation can comprise as multiple synthetic template or the synthetic template of multiple feedback reconnect message of getting the message across.

In step 1930, with message, pass on instruction to control suitable message conveyer to transmit this voice that get the message across according to message language person identity.For example, when the reception and registration condition all is satisfied, can adopt the mode of " message is specified " to complete these voice that get the message across of reception and registration; And be not satisfied and can't complete in the mode of " message specify " while passing on when the reception and registration condition, also the message conveyer can be set as to the mode of communication of " systemic presupposition ", and pass on device to transmit by other, to guarantee can not omit the voice that get the message across.

In sum, the enforcement example of the present invention's announcement can provide the System and method for of a kind of tone information and reception and registration.This implements in example, via order or a message parser, input voice for the writer carry out identification, obtain message language person identity, and whole phonetic entry fragment is measured according to the given syntax and voice confidence level, obtain literal order fragment and filler fragment, and this filler fragment is distinguished to message filler fragment and rubbish filler fragment; From the literal order fragment, obtaining various messages, pass on instruction, according to this message filler fragment, after the input voice, capturing the voice mail message voice, via a message synthesizer, synthesize the voice that get the message across, according to message language person identity, pass on instruction to control suitable message conveyer with message again, to transmit this voice that get the message across.

The enforcement example that as described above only discloses for the present invention, and can not limit according to this scope of the invention process.The equalization that all foundations claim of the present invention is done changes and modifies, and all should still belong to the scope that patent of the present invention contains.

Claims

1. the system of a tone information and reception and registration, this system comprises:

One order or message parser, the input voice from least one writer, dissect out and export multinomial information, and this multinomial information comprises that at least at least one writer's identity, at least one message are transmitted an order and at least one voice mail message voice;

One message synthesizer, be connected to this order or message parser, and by this multinomial information synthetic one voice that get the message across;

At least one message conveyer; And

One transfer control, be connected to this order or message parser, and transmit an order according to this at least one message language person's identity and this at least one message, control a device changeover module, make these voice that get the message across via the message conveyer in this at least one message conveyer, and be sent at least one reception writer

Wherein this order or message parser also comprise a word content analyzer, analyze from the mixed state word be extracted out these input voice, and this word content analyser also comprises:

One concept sequence recombination module, after updating this mixed state word, produce a plurality of image sequences; And

One concept sequence selection module, calculate the corresponding gross score of each image sequence one in the plurality of image sequence, and therefrom select at least one group of best concept sequence be comprised of meaning of one's words frame;

Wherein, this corresponding gross score of each image sequence is to calculate according to the corresponding confidence value of this image sequence and a concept mark.

2. the system as claimed in claim 1, wherein this order or message parser are from identifying this at least one writer's identity this at least one writer's input voice, and measure according to given syntax and a voice confidence level, pick out at least one instruction vocabulary fragment and have at least one filler fragment of phonetic symbol.

3. system as claimed in claim 2, wherein this at least one filler fragment is distinguished message filler fragment and rubbish filler fragment.

4. system as claimed in claim 3, wherein be somebody's turn to do order or message parser from this at least one instruction vocabulary fragment, pick out this at least one message and transmit an order, according to this message filler fragment, from this writer's input voice, capturing this at least one voice mail message voice.

5. the system as claimed in claim 1, wherein this at least one message is transmitted an order and is comprised the message object, when this at least one message sound message is conveyed to the message object and by which kind of message load mode, this at least one message sound message conveyed to the message object.

6. the system as claimed in claim 1, this system are a kind of single or in many ways pass on the system had concurrently with feedback.

7. the system as claimed in claim 1, wherein this order or message parser also comprise:

One voice content acquisition device, receive this at least one writer's input voice, and from these input voice, capturing this writer's status, mixed state word, and the information of these message voice, this mixed state word inputs for this Word message that vocabulary corresponding to voice mixes with phonetic symbol; And

Wherein, described word content analyzer is also transmitted an order from this mixed state word, analyzing this at least one message.

8. the system as claimed in claim 1, wherein this order or message parser also comprise that one confirms interface, this confirmations interface is carried out one and is confirmed to move the accuracy of confirming this multinomial information dissected out.

9. the system as claimed in claim 1, this message synthesizer selects a synthetic template to carry out the synthetic of sentence from the synthetic template database of a language generation, to synthesize this voice that get the message across.

10. the system as claimed in claim 1, wherein, a reception and registration condition in this at least one message is transmitted an order is not satisfied and can't completes while passing on, this transfer control is set as this at least one message conveyer the mode of communication of one systemic presupposition, and transmits this voice that get the message across by another message conveyer at least one message conveyer.

11. the system as claimed in claim 1, wherein this message synthesizer also comprises:

One language generator, receive this at least one writer's identity, this at least one message is transmitted an order and these at least one voice mail message voice, and select a synthetic template, generates a voice signal; And

One VODER, by this voice signal and this these voice that get the message across of at least one voice mail message phonetic synthesis.

12. system as claimed in claim 11, should synthetic template be wherein to select in the synthetic template database of a language generation, the synthetic template database of this language generation comprises multiple synthetic template or the synthetic template of multiple feedback reconnect message or aforementioned multiple synthetic template and the synthetic template of multiple feedback reconnect message of getting the message across of getting the message across.

13. the method for a tone information and reception and registration, the method comprises:

Input voice from least one writer, dissect out multinomial information, and this multinomial information comprises that at least at least one writer's identity, at least one message are transmitted an order and at least one voice mail message voice;

By this multinomial information synthetic one voice that get the message across;

According to this at least one message language person's identity and this at least one message, transmit an order, control a device changeover module, make this message conveyer of voice at least one message conveyer thus that get the message across, and be sent at least one reception writer, and

The mixed state word of analysis from being extracted out these input voice, wherein analyze and should also comprise by mixed state word:

After updating this mixed state word, produce a plurality of image sequences, and calculate the corresponding confidence value of each image sequence one; And

Calculate a concept mark of each image sequence, and, according to this corresponding confidence value and concept scores of each image sequence, calculate the corresponding gross score of each image sequence one, and therefrom select at least one group of best concept sequence formed by meaning of one's words frame.

14. method as claimed in claim 13, the method also comprises:

From this at least one writer's input speech recognition, go out this at least one writer's identity;

According to given syntax and a voice confidence level, measure, pick out at least one literal order fragment and at least one filler fragment; And

From this at least one literal order fragment, obtaining this at least one message, transmit an order, and, according to this at least one filler fragment, obtain this at least one voice mail message voice.

15. method as claimed in claim 13, wherein the synthetic of this language that gets the message across also comprises:

According to this at least one writer's identity, this at least one message, transmit an order and these at least one voice mail message voice, and select a synthetic template in the synthetic template database of a language generation, generate a voice signal; And

By this voice signal and this these voice that get the message across of at least one voice mail message phonetic synthesis.

16. method as claimed in claim 13, wherein this at least one message is transmitted an order and is comprised at least one reception and registration condition, when this at least one reception and registration condition all is satisfied, adopts a kind of mode of the appointment of leaving a message to complete these voice that get the message across of reception and registration.

17. method as claimed in claim 16, wherein be not satisfied and pass on when failed when the reception and registration condition in this at least one reception and registration condition, adopts a kind of mode of communication of systemic presupposition, transmits this voice that get the message across.

18. method as claimed in claim 13, wherein the anatomy of this multinomial information also comprises:

From these input voice, capturing this writer's status, mixed state word, and the information of these message voice, this mixed state word inputs for this Word message that vocabulary corresponding to voice mixes with phonetic symbol; And

Analyze and should mix the state word, transmit an order to obtain this at least one message.

19. method as claimed in claim 13, the method also comprises:

Before transmitting these voice that get the message across, by a confirmation interface, carry out at least one confirmation action, to confirm the maybe accuracy of these voice that get the message across of this multinomial information.

20. method as claimed in claim 13, wherein the corresponding gross score of each image sequence is the summation after this concept scores and this confidence value are distinguished weighting.

21. method as claimed in claim 16, the method also comprises:

By at least one sensing apparatus, judge whether this at least one message this at least one reception and registration condition in transmitting an order is satisfied.

22. method as claimed in claim 17, wherein these voice that get the message across are a feedback reconnect message.