CN1764190B - Distributed speech service - Google Patents

Distributed speech service Download PDF

Info

Publication number
CN1764190B
CN1764190B CN 200510113305 CN200510113305A CN1764190B CN 1764190 B CN1764190 B CN 1764190B CN 200510113305 CN200510113305 CN 200510113305 CN 200510113305 A CN200510113305 A CN 200510113305A CN 1764190 B CN1764190 B CN 1764190B
Authority
CN
China
Prior art keywords
server
data
voice
csta
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200510113305
Other languages
Chinese (zh)
Other versions
CN1764190A (en
Inventor
王冠三
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Corp
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/058,892 external-priority patent/US8396973B2/en
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1764190A publication Critical patent/CN1764190A/en
Application granted granted Critical
Publication of CN1764190B publication Critical patent/CN1764190B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The present invention relates to establishing a media channel and a signaling channel between a client and a server. The media channel uses a chosen codec and protocol for communication. Through the media channel and signaling channel, an application on the client can utilize speech services on the server.

Description

The distributed sound service
The cross reference of related application
The application requires to submit in the U.S. Provisional Patent Application sequence number No.60/621 on October 22nd, 2004,303 priority, and its content whole is incorporated herein by reference.
Background technology
The present invention relates to define method and system with the process computer interactive operation.Especially, the present invention relates to a system, such as a telecommunication system, in equipment between set up the method and system of communication protocol.
Computer support telecommunication application program (CSTA) is a standard group that extensively is suitable for that is used for the whole world and enterprise communication.Especially, CSTA is one and has specified the visit able to programme of telecommunications fabric and the standard of control.Can develop the software that is used for extensive task kind, scope is from setting up and receive simple phone call to passing through audio frequency and the large-scale multipoint cooperative of image management.
The CSTA standardization is in a plurality of ECMA/ISO (ECMA International Rue du Rh
Figure 10003_0
Ne 114 CH-1204Geneva, Www.ecma-international.org) in the standard.The semantical definition of core operation model and CSTA object, service and incident is in ECMA-269.These CSTA characteristics are with an abstract platform-type independent mode definition, and they can be adapted to multiple programming platform like this.In addition, CSTA is also with a plurality of standardization programmings or agreement grammer, and wherein, ECMA-323 has defined the extend markup language (XML) that is bound to CSTA, the promptly common CSTA-XML that knows, and ECMA-348web service description language (sdl) (WSDL) is bound.These language bindings; A part that is considered to the CSTA standard group; Guaranteed maximized interoperability; Make the CSTA characteristic comprise transmission control protocol (TCP), session initiation protocol (SIP) or Simple Object Access Protocol (SOAP), open at the computer that is moving different operating system through any standard transmission protocol.
Come in, CSTA has shown the strong applicability in the interactive voice service field.The enhancing voice service that this applicability is based on speech application linguistic labels (SALT) advances, and SALT further describes in SALT 1.0 specifications, is found in Www.saltforum.orgThrough using SALT, further automation of call center comprises multiple voice correlated characteristic.Yet the difference of calling out control and voice controlling application program has caused the difficulty that promotes the distributed sound service.Like this, exist the demand of when promoting voice service, setting up agreement.
Summary of the invention
The present invention relates between client-server, set up a media channel and a signaling channel.Media channel uses selected coding decoder and agreement for communication.Through media channel and signaling channel, the application program on client computer can be used the voice service on the server.
Description of drawings
Fig. 1-4 shows the example calculation equipment of the present invention that uses.
Fig. 5 shows the example architecture that is used for the distributed sound service.
Fig. 6 shows the example system that is used to realize the distributed sound service.
Fig. 7 shows the illustrative methods that is used for setting up at the SIP environment passage.
Fig. 8 shows the illustrative methods that is used for setting up in the web service environment passage.
Embodiment
Before description was used for the architecture of distributed sound service and is used to realize its method, the common computing equipment that description can be worked in architecture was useful.Referring now to Fig. 1, the exemplary profile of a data management equipment (PIM, PDA or like that) illustrates 30.Yet, can be contemplated that when the present invention uses following computing equipment usefully equally, and especially, these computing equipments contain limited surface area and are used to import button or like that.For example, phone and/or data management apparatus also will have benefited from the present invention.These equipment will contain the function that the existing mobile personal information management apparatus of contrast and other portable electric appts have strengthened, and the function of these equipment and size dimension seem more encourage users and all carry these equipment at any time.Correspondingly, example data disclosed here management or the PIM equipment, phone and the computer that illustrate are not the scopes of intention restriction present architecture.
The exemplary profile of a data management mobile device 30 is shown in Fig. 1.Mobile device 30 comprises a frame 32 and has a user interface and comprise a display 34, it it used the sensing contact display screen together with a stylus 33.Stylus 33 be used for the respective point of appointment push or contact display 34 with select a panel region, selectively moving hand the starting position, in addition or such as through gesture or write command information is provided.Alternatively, or in addition, one or more buttons 35 can be included on the equipment 30, are used for navigation.In addition, other input mechanisms such as moving runner, cylinder or like that can be provided.Yet what need indicate is that the present invention is not the input mechanism that intention is subject to these forms.For example, other forms of input mechanism can comprise such as a virtual input through computer image.
With reference now to Fig. 2,, block diagram shows the functional unit that comprises mobile device 30.A CPU (CPU) 50 has realized the software control function.Thereby CPU 50 connects display 34 to be presented on the display 34 according to text and the logos that Control Software generates.Loud speaker 43 generally is connected to CPU 50 through digital to analog converter 59 can listen output to provide.The storage of downloading or keying in mobile device 30 by the user is at a non-volatile read/write random access storage device 54, the two-way CPU 50 that is connected to of the latter.
Random access storage device (RAM) 54 provides volatile storage for the instruction that CPU 50 carries out, and the storage ephemeral data, such as the registration table key assignments.Config option key assignments and its dependent variable of acquiescence are stored in the read-only memory (ROM) 58.ROM 58 can be used to store the basic function of the controlling mobile equipment 30 that is used for equipment and the Core Feature of other operating systems operating system software of (like, load software assembly in RAM 54) equally.
RAM 54 is also as a memory that is used for code, through being used for the mode of the hard disk of application storing on the similar PC.Though what need indicate is that nonvolatile memory is used to storage code, it can be stored on the volatile memory that is not used in the code execution selectively.
But through wireless transceiver 52 wireless signal mobile device transmission that are connected to CPU 50.Optional communication interface 60 also can be provided for from computer (like, desktop computer), or cable network, if necessary, and direct data download.Therefore, interface 60 can comprise the communication equipment of various ways, for example, and infrared link, modulator-demodulator, network interface card or like that.
Mobile device 30 comprise a microphone 29, modulus (A/D) transducer 37 and one optional be stored in recognition application on the memory 54 (voice, DTMF, hand-written, posture or computer image).As an example, in response to the information that can listen, from equipment 30 users' instruction or order, microphone 29 provides voice signal, and the latter is by A/D converter 37 digitlizations.Speech recognition application programming interface can be on the datumization voice signal operative normization and/or feature extraction functions with voice identification result in the middle of obtaining.
Use wireless transceiver 52 or communication interface 60, speech data be transferred to one hereinafter described and at the remote speech server 204 shown in the architecture of Fig. 5.Recognition result return to subsequently mobile device 30 with appear there (as, visual and/or can listen), and be transferred to a web server 202 (Fig. 5) at last, wherein web server 202 operates to client/server relationship with mobile device 30.
Same processing can be used to other forms of input.For example, handwriting input can be combined/or be not combined in preliminary treatment and the digitlization on the equipment 30.Analogous terms sound data, the input of this form can be transferred to voice server 204 and be used for identification, and wherein recognition result is returned at least one in equipment 30 and/or the web server 202.Similar have, and DTMF data, gesture data and image data can be by similar processing.
According to the type of input, equipment 30 (and client computer of discussing below other forms of) will comprise necessary hardware, such as the camera that is used for the image input.
Fig. 3 is the plan view of a portable phone 80 of exemplary embodiment first.Phone 80 comprises a display 82 and a keyboard 84.Usually, the block diagram among Fig. 2 is applied to the phone among Fig. 3, though possibly be used to carry out the additional circuit of other functions.For example, the embodiment as far as Fig. 2 requires one to be necessary transceiver as phone operation; Yet these circuit and the present invention are uncorrelated.
Except above-mentioned portable or mobile computing device, can be understood that also the present invention can be used to for example general desktop of multiple other computing equipments.For example, the present invention can allow the user of a health obstacle when being difficult to other conventional input equipments of operation (such as full word symbol-numeric keypad), to computer or input of other computing equipments or key entry text.
The present invention is operable in multiple other general or special purpose computing equipments, environment or the configuration equally.
The example that is fit to famous computing system, environment and/or the configuration of combination the present invention use comprises; But be not limited to; Routine call (not with any screen), personal computer, server computer, hand-held or laptop devices, multicomputer system, the system based on microprocessor, STB, programmable consumer electronics, radio frequency identification (RFID) equipment, network PC, minicomputer, large-scale computer, comprise the DCE of any said system or equipment, and the like.
Described below is concise and to the point description to the general utility functions computer 120 shown in Fig. 4.Yet computer 120 is that the example first of the computing environment that is fit to is not to be intended to provide any restriction to the scope of application of the present invention or function equally.Computer 120 can not be interpreted as and contain any dependence or requirement that relates to arbitrary or its combination of assembly shown here.
The present invention can be described to the common form of computer executable instructions, such as the program module of being carried out by one or more computers or other equipment.Usually, program module comprise the routine carrying out particular task or realize particular abstract, application program, object, assembly, data structure, and the like.The present invention also can be implemented in DCE, and the task among the latter is performed by the teleprocessing equipment that is linked together by communication network.In a DCE, program module can be arranged in this locality and remote computer storage medium, comprises memory storage equipment.Being carried out by application program and module of task has been described by accompanying drawing hereinafter.Those skilled in the art can realize this description and accompanying drawing through the computer executable instructions that is written into any type of computer-readable medium.
With reference to figure 4, the assembly of computer 120 can include, but not limited to a processing unit 140, a system storage 150 and will comprise that the various system components of system storage are connected to the system bus 141 of processing unit 140.System bus 141 can be polytype bus structures, comprises memory bus or Memory Controller, peripheral bus and the local bus that uses any multiple bus architecture.
As an example; And unrestricted, these architectures comprise ISA(Industry Standard Architecture) bus, USB (USB), MCA (MCA) bus, enhancement mode ISA (EISA) bus, VESA's (VESA) local bus and peripheral device component interconnection (PCI) bus (also being called as the Mezzanine bus).Computer 120 generally includes multiple computer-readable medium.Computer-readable medium can be computer 120 addressable any usable mediums, and comprises the non-volatile media that is prone to become estranged, removable and removable medium not.Through example, and unrestricted, computer-readable medium can comprise computer-readable storage medium and communication media.Computer-readable storage medium comprises through any method or technology to be realized, is used to store such as the information of computer-readable instruction, data structure, program module or other data, and it is non-volatile to be prone to become estranged, removable and removable medium not.Computer-readable storage medium comprises; But be not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital universal disc (DVD) or other optical disc memories, cassette, tape, magnetic disc store or other magnetic storage device or any other can be used to store information needed also can be by the medium of computer 120 visits.
Communication media is usually with a modulated data-signal, embodies computer-readable instruction, data structure, program module or other data such as the form of carrier wave or other transmission mechanisms, and comprises any information transmitting medium.Term " modulated data-signal " expression is provided with for coded message in signal or changes the signal of its one or more characteristics.Through example, and unrestricted, communication media comprises wire medium, such as cable network or straight line connect and wireless medium such as acoustics, FR, infrared and other wireless mediums.Any above-mentioned combination also is included in the scope of computer-readable medium.
System storage 150 comprises the computer-storage media that is prone to mistake and/or nonvolatile storage form, such as read-only memory (ROM) 151 and random access storage device (RAM) 152.Basic input/output 153 (BIOS) generally is stored among the ROM 151; It comprises that RAM 152 generally includes can be by processing unit 140 zero accesses and/or data of operating and/or program module such as the basic routine that in starting process, helps the information of transmitting between primary element in the computer 120.Through example, and unrestricted, Fig. 4 shows operating system 154, application program 155, other program modules 156 and routine data 157.
Computer 120 also can comprise other removable/immovable easy mistake/nonvolatile computer storage media.
As just example, disc driver 171 and a CD drive 175 that reads or write from removable nonvolatile optical disk 176 that Fig. 4 shows hard disk drive 161 that never removable non-volatile magnetic medium reads or write, reads or write from removable non-volatile disk 172 such as CD-ROM or other optical mediums.Other can be used to the exemplary operation environment removable/not removable, be prone to mistake/nonvolatile computer storage media include, but not limited to cassette, flash memory card, digital versatile disc, digital video tape, solid-state ROM, solid-state ROM, or the like.Hard disk drive 161 is connected to system bus 141 through the not removable memory interface such as interface 160 usually, and disc driver 171 and CD drive 175 are connected to system bus 141 through a removable memory interface such as interface 170.
The driver computer-readable storage medium relevant with them of more than discussing and being shown among Fig. 4 provides storage for computer-readable instruction, data structure, program module and other data from computer 120.In Fig. 4, for example, hard disk drive 161 is illustrated as storage operating system 164, application program 165, other program modules 166 and routine data 167.Notice that these assemblies can be identical or different with operating system 154, application program 155, other program modules 156 and routine data 157.Operating system 164, application program 165, other program modules 166 and routine data 167 have been given different labels here and have been used to be illustrated under the Min., and they are different copies.
The user can be via an input equipment such as keyboard 182, microphone 183 and such as the pointing device 181 of mouse, trace ball or touch pad, and input command and information enter into computer 120.Other input equipment (not shown)s can comprise joystick, game mat, satellite antenna, scanner or the like.These and other input equipment often is connected to processing unit 140 via being connected to system bus user input interface 180, but also can be connected with bus structures through other interfaces, such as parallel port, game port or USB (USB).The display device of one monitor 184 or other types also is connected to system bus 141 via the interface such as video interface 185.Except monitor, computer also can comprise other peripheral output equipment such as loud speakers 187 and printer 186, they can be connected via output peripheral interface 188.
The logic that uses one or more remote computers such as remote computer 194 connects, and computer 120 is operable under the network environment.Remote computer 194 can be personal computer, handheld device, server, router, network PC, peer device or other common network node, and generally includes many or all above-mentioned elements that relate to computer 120.The logic that Fig. 4 describes connects and comprises Local Area Network 191 and wide area network (WAN) 193, but also comprises other networks.These networked environments are at office, enterprise-wide computing, in-house network and be common on the internet.
When using in the lan network environment, computer 120 is connected to LAN191 via network interface or adapter 190.When using in the WAN network environment, computer 120 generally includes a modulator-demodulator 192 or other set up communicating devices on the WAN 193 such as the internet.Built-in or external modulator-demodulator 192 can be connected on the system bus 141 via user's input interface 180 or other suitable mechanism.In the environment of a networking, the said program module that relates to computer 120, or its part can be stored in the remote memory equipment.Through example, and unrestricted, Fig. 4 shows the remote application 195 that is positioned on the remote computer 194.It is device exemplary and that can use other between computer, to establish a communications link that network shown in can being understood that connects.
Fig. 5 shows the example architecture 200 of the distributed sound service that is used for being realized by the present invention.
Usually; The information that is stored in web server 202 can (also be represented other forms of computing equipment here through mobile device 30; According to requirement, comprise a display screen, a microphone, a camera, a tactiosensible screen or the like to input form) visit, or through phone 80; The button of pressing through audio frequency or through response this moment comes solicited message by the tone that phone 80 generates, and is only provided by audio frequency from the information of web server 202 and to return to the user.
But the more important thing is that architecture 200 is unified, no matter information obtains via phone 80 via equipment 30 or through speech recognition like this, an independent voice server 204 can be supported various operator schemes.In addition, architecture 200 utilize expansion well-known SGML (as, HTML, XHTML, cHTML, XML, WML, or the like) operation.Like this, the information that is stored on the web server 202 also can be by through known GUI way access of being set up by these SGMLs.Through utilizing the well-known SGML of expansion, the creation on web server 202 is easier, and the legacy application program that produces recently also can more easily be modified to comprise speech recognition.
Usually, equipment 30 is carried out the HTML+ script that provided by web server 202 or it is similar.When the needs speech recognition; Pass through example; Speech data (it can be digital audio signal or phonetic feature, the audio signal among the latter as aforementionedly carried out preliminary treatment by equipment 30) will connect the grammer that uses in the same speech recognition or the prompting of language model is provided for voice server 204 together.The realization of voice server 204 can be taked various ways, and one of them illustrates, but generally includes an identifier 211.Need or compatibly, recognition result is provided the equipment of returning to 30 and is used for local the reproduction if having.When carrying out information compile time via identification and any graphical user interface that uses, if be necessary, equipment 30 sends these information and is used for further handling and obtaining further html script for web server 202.
As shown in Figure 5, equipment 30, web server 202 and voice server 204 connect via network 205 routinely, and separate on the region, and network 205 is the wide area network such as the internet at this.Therefore any of these equipment need not placed physically contiguous each other.Especially, web server 202 need not comprise voice server 204.Like this, make on the application program just can concentrate on expection at web server 202 places and do not need the author to go to understand the complexity of voice server 204.In addition, voice server 204 can be designed and be connected to network 205 independently, and, need web server 202 further modifications not upgraded and improve.In a deep embodiment, client computer 30 does not need web server 202, can directly communicate by letter with voice server 204.Can be understood that further that web server 202, voice server 204 and client computer can be combined according to the capacity of implementing machine.For example, if client computer comprises a general utility functions computer such as personal computer, client computer just can comprise voice server 204.Similarly, if desired, web server 202 can be incorporated in the independent machine with voice server 204.
Comprise through the path of 80 pairs of web servers 202 of phone and the connection of 80 to wired or wireless telephone networks 208 of phone then in order, phone 80 to be connected to third party's gateway 210.Gateway 210 connects 80 to call voice browsers 212 of phone.Call voice browser 212 comprises media server 214 and voice browser 216 that telephony interface is provided.Be similar to equipment 30, call voice browser 212 receives html script or like that from web server 202.Yet the more important thing is that this html script is similar to the html script of the equipment of offering 30 in form.Like this, web server 202 need not distinguished support equipment 30 and phone 80, or even supports standard GUI client computer separately.On the contrary, can use a kind of public SGML.In addition, be similar to equipment 30, the speech recognition of being transmitted by phone 80 from acoustic signal offers voice server 204 by voice browser 216, and this can pass through network 205, or through a special circuit 207 that for example uses TCP/IP.Web server 202, voice server 204 and call voice browser 212 can be embodied in any suitable computing environment, general utility functions desktop computer for example illustrated in fig. 4.
Yet, it should be noted that this identification form can be carried out at media server 214 usually, rather than carries out at voice server 204 if use DTMF identification.。In other words, the DTMF grammer can be used by media server.
After preceding text have provided equipment and architecture, further the present invention is described based on a simple client/server environment.As shown in Figure 6ly go out, the present invention is applicable to a system 300, it comprise one provide media services (as, speech recognition or text voice be synthetic) server 302 and the client computer 304 of an executive utility particular code.Communication between server 302 and client computer 304 is based on a service model, and information can or be stamped label or comprise identification division by exchange therein, such as but be not limited to XML (extending mark language) document.Except these information, audio frequency can collected and transmit to server 302 and/or client computer 304.In embodiment first; Server 302 can comprise the Microsoft's voice server by the Microsoft's exploitation that is positioned at Washington Lei Demengde (Redmond), and client computer 304 can adopt any amount of above-mentioned form, includes but not limited to; Desktop PC, mobile device, or the like.
Though this need indicate be server 302 and client computer 304 based on a service model mutually between communication; Use the application program of some aspect of the present invention not need ad hoc to write according to a kind of service model in this statement; And/or as long as between server 302 and the client computer 304 during executive communication, just can be according to the application program of service model request use based on process.In embodiment first, client applications can be compiled as C++, Java, C# or other imperative programming language, it does not need one similar shown in Figure 5 based on the browser in the situation of HTML application program.
An importance in CSTA (ECMA-269) version 6 is based on the enhancement mode voice service of speech application linguistic labels (SALT).Initiate characteristic comprises automatic speech recognition, text one phonetic synthesis that voice are confirmed, loud speaker is differentiated, loud speaker is confirmed and can in system 300, be realized.Some or all of characteristics can provide in the call center automatically.Some aspect of the present invention provides a sub-set of CSTA service, is used to be convenient to based on network voice service.Especially, aspects more of the present invention show ECMA-348 and uaCSTA (ECMA-TR/87) how by be applied to respectively help one in the web service the distributed sound service and based on VoIP (Voice-over IP is based on the voice protocol of the internet) environment of SIP (session initiation protocol).
Be used for the service of computer support telecommunication application program (CSTA) ECMA-269, and their XML and web service agreement are defined by ECMA-323 and ECMA-348 respectively.Recently, ECMA-TR/87 (uaCSTA) has further described a series of SIP conversion that are used for using at the VoIP environment ECMA-323.
All these agreements have proposed a complete set of CSTA in principle, and can specificly be used for voice service.In the 6th edition of ECMA-269, the voice service of CSTA part derives technology based on SALT and has expanded.
Except existing voice service, new increase comprises for call center's automation and comprises that mobile applications such as automatic speech recognition, voice affirmation, loud speaker discriminating, loud speaker affirmation and text-phonetic synthesis are necessary key technologies.
Expect for application developers though closely integrated CSTA calls out the realization of control and voice scheme, call out the core competence between control and the voice supply and need not be identical.For current configuration and foreseeable future, the CSTA application developers can be called multiple supplier and satisfy their the corresponding demand in these fields.Fortunately, the CSTA model concept, of ECMA-269, allow single application program to draw service from multiple CSTA ISP.Such scheme is that a CSTA application program can use two kinds of CSTA to realize simultaneously, and one is used for calling out control, and another is used for voice service.
The CSTA profile that is used for voice service is not also as it is accurate calling control field.Some aspect profile of the present invention a kind of CSTA profile of voice service being provided using on the platform self-contained unit of XML of being used for.Though the CSTA profile is exactly a carrier; Its person's character is indefinite; In this example interactive with opposite end, promote-side better of the common application of two voice service profiles: based on the SIP environment of small-sized use CSTA, and based on ECMA-348 based on the web service environment.
Provide in this description that provides the subclass of CSTA voice service is how can be included in the example of being convenient to based in the speech processes of client computer-server.Following ECMA standard is incorporated herein by reference as a whole: ECNA-269 is used for the service of computer support telecommunication application program (CSTA), the stage 3; ECMA-323 is used for the SMLP agreement of computer support telecommunication application program (CSTA), stage 3; And ECMA-348, be used for the web service description language (sdl) (WSDL) of CSTA.In addition, this application program has been described the CSTA voice service and how in the VoIP environment based on SIP that uses uaCSTA to propose, have been carried out.
ECMA-TR/87 can be used as a reference to uaCSTA, and its a copy is incorporated herein by reference.
Speech processes based on client computer-server described here can be handled asymmetric medium type a response/request in the cycle.For example, when speech-recognition services was provided, server returned to client computer with the transfer of data that voice data will convert into after text data also will be changed.In the situation of phonetic synthesis, the client transmission text data, server response is with the voice data after changing.The data of transmission can be sent such as the specific protocol based on the agreement of CSTA according to one.The result is that SIP and web service environment can be expanded and comprise text-audio frequency or audio frequency-text, audio frequency-audio frequency interactive operation.
ECMA-TR/87 has set up " signaling channel " 308 carriers as shown in Figure 6.Signaling channel 308 by server 302 and client computer 304 make be used for exchanging separately corresponding to call out control the information of the content that should accomplish.When server 302 comprises a telephone exchange, be sufficient to the use of signaling channel 308.
Yet, if being a voice server and client computer 304, server 302 asking voice service, server 302 also must be known reception and transmitting voice information wherein.For example, server 302 will be appreciated that and obtains voice recognition information wherein, and whither sends the voice after synthesizing.
Like this, except setting up a signaling channel 308, also must set up " media channel " 310 agreements.For example, media channel 310 is used to the speech data (audio tones audio data) of server 302 transmission by client computer 304 collections.Similarly, in a text-voice operating, the speech data after synthetic is provided when returning to client computer 304 through media channel 310 by server 302, and client computer 304 can be sent text datas through signaling channel 308.
With reference to the architecture of figure 5, signaling channel 308 is established to be used for any communicating by letter to voice server 204 with media channel 310.Yet, need to indicate be use to weblication server 202 be alternatively and application program can be placed on the client computer 30, as shown in Figure 5.
One aspect of the present invention is to take which step to realize media channel 310.In exemplary embodiment first, discussed under the SIP environment and set up a media channel 310 for CSTA.First further among the embodiment, discussed at one and taked which step with realization media channel 310 for CSTA down based on the environment of web server.
What need indicate is that semantic information can be transmitted between server 302 and client computer 304; For example through using speech application descriptive language (SADL); SADL can be the specified XML scheme of the result that returned by the listener resource (like, the result who is returned by the server that has speech recognition 302).
Passage under the SIP environment is set up
SIP is one and is designed to " loquacity " agreement, because server 302 and the little information fragmentation of client computer 304 frequent exchange.Under the SIP environment, accomplish foundation to media channel 310 through Session Description Protocol (SDP).An exemplary method 400 accomplishing this task has been shown in Fig. 7.
In step 402, client computer 304 uses SIP-invitation to start a session with server 302.
Also send a SDP and described, stated the IP that will be used (Internet Protocol) address and the IP address port that will be used to the audio frequency audio frequency.In addition, in step 404, which kind of codec type this SDP description can broadcast will be used to Media Stream, and the communication protocol such as transmission control protocol (TCP) or real-time transport protocol (rtp).
When server receives, can determine whether accept the SDP description that client computer 304 proposes at step 406 server.If agreement and coding decoder have been accepted, server 302 is responded the SDP description of listing self IP address and audio port of a SIP-approval and it.Then, method 400 proceeds to step 408, has set up a signaling channel at this.
Optional is, if server 302 is not supported the coding decoder and the agreement of being advised, server 302 can begin to confer which kind of coding decoder of use and/or agreement with client computer 304.In other words, server 302 can come customer in response machine 304 initial SDP to describe with the motion of an opposite proposition different coding decoder and/or agreement.Before making motion, method 400 proceeds to step 410, wherein makes judging whether to continue identification.For example, in step 412, after the phase counterproposal of a specific quantity was suggested, communication can stop.Between step 414 client computer 304 and server 302, can make extra phase counterproposal up to reaching an agreement or can not reaching any unanimity up to clear.
SIP/SDP is the standard that is used on VoIP, setting up voice-grade channel by internet engineering task group (IETF) approval.Yet SIP/SDP does not describe the method that realizes the signaling channel of CSTA of setting up.In step 408, set up signaling channel 308 via ECMA-TR/87.After signaling channel was set up, the application program association was considered to accomplish.The result is in system 300, can realize the distributed sound service.
Passage under the web server environment is set up
With respect to " loquacity " person's character of above-mentioned SIP, the web server be designed and usually optimum turn to and be used for " sturdy " communication, between server 302 and client computer 304, need less dialogue exchange like this.
The result is, the characteristic of in a plurality of dialogue bouts, conferring among the SIP is usually through being that the service describing of being issued by the public directory that is used for the web service or in the exchange of web service metadata, dynamically obtain is described and found.The web service environment comprises a UDDI (the unified description found comprehensively) standard agreement.The web ISP issues relevant information, and application developers can be found, obtain and therefore select the proper service supplier, allows application developers dynamically the web service to be integrated in the application program like this.For example, ECMA-348 is that CSTA has specified web service description language (sdl) (WSDL) can use standard web service agreement to describe uniformly, find like this and the web of the integrated CSTA of providing function serves.The foundation of media channel is the expansion of ECMA-348.
Fig. 8 shows an illustrative methods 420 under the web service environment, setting up passage.In the present invention, in step 422, the web ISP has listed all coding decoders and the agreement of web service support with the form of service metadata.In step 424, application developers web service catalogue capable of using supplier obtains or finds that which web service contains their spendable coding decoder and agreements.The metadata that this step can be implemented as each web service of search to be provided finds the required coding decoder and the agreement of its requirement.Catalogue provides a URL (Universal Resource Locator) address for each web service.Client computer 304 is set up an application program that has required coding decoder and agreement to the connection and the use of web service subsequently and is communicated by letter with server 302.After connecting foundation, set up media channel 310 and its signaling channel 308 immediately.
How the present invention under the web service environment has pointed out to expand through a media description to WSDL and has been based upon once exchange and passes being connected of all layers (application layer and transport layer).In embodiment first, the present invention can combine ECMA-348 to use, and ECMA-348 has contained a mechanism of setting up CSTA and its lower layer signaling host-host protocol.Through in ECMA-348, adding media coding and host-host protocol expansion, therefore CSTA is enhanced to and in single step, sets up signaling and media channel.
Then among the embodiment, use the extensibility of web service addressing (being WS-addressing) agreement to come transfer medium to describe, at another as a related step of previous CSTA application program.WS-addressing (WSA) is a standard that the neutral mechanism of transmission is provided for addressing web service terminal and message.CSTA function of exchange and CSTA application program are all the web service terminal.WS-addressing has proposed a new criteria that is called terminal point parameter (endpointreference), the dynamic use of the service that its supports can not to contain just < wsdl:service>among the WSDL and < wsdl:port>element.
WS-addressing has defined an XML document type (wsa:EndpointReferenceType) to represent a terminal point parameter.An XML element, wsa:EndpointReference is designated as equally and contains the type.Above both is left among the name space http://schemas.xmlsoap.org/ws/2004/03/addressing of XML.
A WSA terminal point parametric type can comprise following:
[address]:: the URI of a definite terminal point (generic resource location).
[reference properties]: < xs:any/>(0.. non-boundary), detailed attributes, each entity that is transmitted or resource contain one.
[selected port type]: QName (0..1) is defined in the title of the master port type that is used for terminal point among the WSDL.
[service and port]: (QName, NCName (0..1)) (0..1) is defined among the WSDL service and port corresponding to this terminal point.
[policy]: optional WS-policy elements, behavior, requirement and the ability of terminal point have been described.
In the situation of SIP, it is necessary setting up a voice-grade channel for the CSTA voice service.Be similar to and can in SIP, consult a voice-grade channel through SDP, WSA terminal point parameter can be used to the voice service supplier and state the medium terminal point.Media transmission protocol and encoding mechanism belong to and need be specified in order to help the key project of voice service.These projects are declared as the parameter attribute.
In order to strengthen durability, the media channel under the web service environment is turned to from a kind of the renting (lease) of server (CSTA voice resource supplier) to client computer (CSTA application program) by module, and this is rented along with overtime and expired.Server also can be specified one and rented manager, and client computer can be cancelled or upgrade and rent therein.
A CSTA medium terminal point parametric type that has the XML scheme comprises one or more WSA terminal point parameters.For example, the CSTA voice service supplier of a G.711 agreement of on port 6060, using based on real-time transport protocol (rtp) can describe the medium terminal point as follows:
<csta:MediaEndpointReference
xmlns:csta=″http://www.ecma?international.org/TR/xx″
xmlns:wsa=″http://schemas.xmlsoap.org/ws/2004/03/ad
dressing″>
<wsa:Address>rtp://server.acme.com:6060</>wsa:Address
>
<wsa:ReferenceProperties>
<csta:Codec>G.711</csta:Codec>
<csta:SubscriptionID>12345</csta:SubscriptionID>
<csta:Expires>2004-10-21T21:07:00.000-08:00</>csta:Ex
pires>
</wsa:ReferenceProperties>
</csta:MediaEndpointReference>
CSTA medium terminal point parameter attribute comprises that the statement of coding decoder, one are subscribed and confirms and one optional rents expired statement.In the uaCSTA situation, after media channel was set up together with signaling channel, above-mentioned medium terminal point parameter must be in being included in before related completion of CSTA application program of handling under the web service environment.
By the advantage of WS protocol extension property, can use < wsa:Action>to set up a voice conversation.Medium terminal point parameter itself can be an attribute in CSTAweb ISP's the terminal point parameter.As follows, can write one section Simple Object Access Protocol (SOAP) message through directly adding this medium terminal point parameter afterwards at < wsa:To >:
<soap:Envelop
xmlns:soap=″http:/www.w3.org/2003/05/soap-envelop″
xmlns:wsa=″http:/schemas.xmlsoap.org/ws/2004/03/addr
essing″
xmlns:csta=″http:/www.ecma-international.org/TR/xx″>
<soap:Header>
<wsa:ReplyTo>
<wsa:Address>http:/example.client.com</wsa:Address>
</wsa:ReplyTo>
<wsa:To>http:/server.acme.com</wsa:To>
<csta:MediaEndpointReference>
</csta:MediaEndpointReference>
<wsa:Action>
http:/www.ecma-international.org/TR/xx/CreateSessio
n
</was:Action>
<wsa:MessageID>...</wsa:MessageID>
</soap:Header>
<soap:Body>
</soap:Body>
</soap:Envelop>
The web service is by describing such as the metadata of WS-strategy or WSDL.When the WS-strategy had been described the general ability of serving, requirement and characteristic, WSDL had described operation of extraction message and concrete procotol and has arrived the address that this web serves.The exchange of web service metadata, promptly WS-MEX or WSX are standards that the guiding element data are obtained.Client computer can be sent a WS-MEX request to obtain its metadata to a terminal point.Below be a standardization summary of using the request of SOAP:
<soap:Envelope...>
<soap:Header...>
<wsa:Action>
http://schemss.xmlsoap.org/ws/2004/09/mex/GetMetada
ta/Request
</wsa:Action>
<wsa:MessageID><xs:anyURI/></wsa:MessageID>
<wsa:ReplyTo>WS-addressing terminal point parameter</wsa:ReplyTo>
<wsa:To><xs:anyURI/></wsa:To>
</soap:Header>
<soap:Body>
<wsx:GetMetadata...>
[<wsx:Dialect[Identifier=′<xs:anyURI/>′]?>
<xs:anyURI/>
</wsx:Dialect>
]*
</wsx:GetMetadata>
</soap:Body>
</soap:Envelop>
Shown in the SOAP head, WS-MEX use WS-addressing is specified and is used for the request that metadata is obtained.Destination service is designated as a URI in < wsa:To >, and in the content of < wsa:ReplyTo >, has specified answer terminal point (reply endpoint) with WS-addressing terminal point parameter.The metadata type that will be obtained is specified in < wsx:GetMetadata>content in the SOAP main body.
If a terminal point has been accepted the GetMetadata request, it must reply a GetMetadata response message.Below be the standardization summary of the response in SOAP:
<soap:Envelop...>
<soap:Header...>
<wsa:Action>
http://schemas.xmlsoap.org/ws/2004/09/mex/GetMetadat
a/Response
</wsa:Action>
<wsa:RelatesTo>previousmessageid</wsa:RelatesTo>
<wsa:To><xs:anyURI/></wsa:To>
</soap:Header>
<soap:Body>
<wsx:Metadata...>
[<wsa:MetadataSection?Dialect=″dialect?URI″
[Identi?fier=′previous?identifier′]>
<xs:any/><!--service?specific?data?section-->
|
<wsx:MetadataReference>
WS-Addressing?endpoint?reference
</wsx:MetadataRefernce>
|
<wsx:Location><xs:anyURI/></wsx:Location>
]
</wsa:MetadataSection>]*
</wsx:Metadata>
</soap:Body>
</soap:Envelop>
The metadata that in the SOAP main body, transmits can be used as the content of < wsx:Metadata>and is returned by embedded, or through using the parameter of WS-addressing terminal point parameter or simple URI.
Above-mentioned soap message can contain following WSDL to be bound:
<wsdl:message?name=″GetMetadataMsg″>
<wsdl:part?name=″body″element=″tns:GetMetadata″/>
</wsdl:message>
<wsdl:message?name=″GetMetadataResponseMsg″>
<wsdl:part?name=″body″element=″tns:Metadata″/>
</wsdl:message>
<wsdl:portType?name=″MetadataExchange″>
<wsdl:operation?name=″GetMetadata″>
<wsdl:input?message=″tns:GetMetadataMsg″
wsa:Action=
″http:/schemas.xmlsoap.org/ws/2004/09/mex/GetMetada
ta/Request″/>
<wsdl:output?message=″tns:GetMetadataResponseMsg″
wsa:Action=
″http:/schemas.xmlsoap.org/ws/2004/09/mex/GetMetada
ta/Response″/>
</wsdl:operation>
</wsdl:portType>
The CSTA media description is that a kind of CSTA application program must be from the metadata type of voice service supplier acquisition.WS-MEX is especially suitable at this.Be a simple soap message that is used to obtain metadata terminal point parameter below:
<soap:Envelope
xmlns:soap=″http:/www.w3.org/2003/05/soap-envelop″
xmlns:wsa=″http:/schemas.xmlsoap.org/ws/2004/08/addr
essing″
xmlns:wsx=″http:/schemas.xmlsoap.org/ws/2004/09/mex″
xmlns:csta=″http:/www.ecma-international.org/TR/XX″>
</soap:Header>
<wsa:Action>
http:/schemas.xmlsoap.org/ws/2004/09/mex/GetMetadat
a/Request
</wsa:Action>
<wsa:MessageID>
uuid:12345edf-53c1-4923-ba23-23459cee433e
</wsa:MessageID>
<wsa:ReplyTo>
<wsa:Address>http:/client.example.com/MyEndpoint</>ws
a:Address>
</wsa:ReplyTo>
<wsa:To>http:/server.acme.org</wsa:To>
</soap:Header>
<soap:Body>
<wsx:GetMetadata>
<wsx:Dialect>
http:/www.ecma-international.org/TR/XX/MediaEndpoin
t
</wsx:Dialect>
</wsx:GetMetadata>
</soap:Body>
</soap:Envelop>
Example shown a client applications that is positioned at client.example.com ask medium terminal point parameter to the CSTA voice service supplier who is positioned at server.acme.org.Because specified a specific dialect, server must only be responded the metadata of required type.A SOAP response message can be:
<soap:Envelop...>
<soap:Header>
<wsa:Action>
http:/schemas.xmlsoap.org/ws/2004/09/mex/GetMetadat
a/Response
</wsa:Action>
<wsa:RelateTo>
uuid:12345edf-53c1-4923-ba23-23459cee433e
</wsa:RelateTo>
<wsa:To>http:/client.example.com/MyEndpoint</>wsa:To
>
</soap:Header>
<soap:Body>
<wsx:Metadata>
<wsx:MetadataSection?Dialect=
″http:/www.ecma-international.org/TR/XX/MediaEndpoi
nt″>
<csta:MediaEndpointReference>
<wsa:Address>rtp:/server.acme.org:6060</>wsa:Address
>
<wsa:ReferenceProperties>
<csta:Codec>G.711</csta:Codec>
<csta:SubscriptionID>12345</csta:SubscriptionID>
<csta:Expires>2004-10-21T21:00:00.0-22:00</>csta:exp
ires>
</wsa:ReferenceProperties>
</csta:MediaEndpointReference>
</wsx:MetadataSection>
</wsx:Metadata>
</soap:Body>
</soap:Envelop>
It is the metadata of voice service another type that can provide that speech application is described.Through increasing < wsx:GetMetadata>and their corresponding URI, can obtain multiple metadata type at one time via < wsx:Dialect >.Be an example that is used to obtain the SOAP main body of medium terminal point and speech application parameter below:
<wsx:GetMetadata>
<wsx:Dialect>
http:/www.ecma-international.org/TR/xx/MediaEndpoin
t
</wsx:Dialect>
<wsx:Dialect>
http:/www.ecma-international.org/TR/xx/SpeechApplic
ationDescription
</wsx:Dialect>
</wsx:GetMetadata>
The?corresponding?response?in?the?SOAP?body:
<wsx:Metadata>
<wsx:MetadataSection Dialect=
″http:/www.emca-international.org/TR/xx/MediaEndpoi
nt″>
</wsx:MetadataSection>
<wsx:MetadataSection?Dialect=
″http:/www.ecma-international.org/TR/xx/SpeechAppli
ationDescription″>
<csta:resource?id=″US?AddressRecognition″>
<csta:type>Listener</csta:type>
<csta:grammar
uri=″urn:acme.com/address/street?number.grxm
l″
schema=″urn:acme.com/address/street?number.xsd
″/>
<csta:grammar
uri=″urn:acme.com/address/city.grxml″>
<csta:rule?id=″zip_code″
schema=″urn:acme.com/address/zip.x
sd″/>
<csta:rule?id=″city_state″
schema=″urn:acme.com/address/city.
xsd″/>
</csta:grammar>
</csta:resource>
</wsx:MetadataSection>
</wsx:Metadata>
When web service unilateral initiative, request and answer model, the web service hopes when other services or application program generation incident, to receive message usually.Web Service events or WS-incident (WSE) are standards that promotes event notice.WS-event definition web service how to subscribe the incident of represent other services or application program, and how permission application program allocate event message is transmitted.It supports a large-scale incident to open up benefit, allows departure event source and final incident receiver.These attributes are suitable for large-scale CSTA application program, comprise that the call center arrives mobile computing.It is because the CSTA voice service needs event notice to come work that use to the WS-incident is provided.
Though described the present invention with reference to specific embodiment, those skilled in the art it is understandable that making the change on form and the details and do not deviate from spirit and scope of the invention.

Claims (14)

1. a method for communicating between client-server is characterized in that, comprising:
Utilize said client computer to propose first coding decoder and first agreement;
Utilize said server to confirm whether first coding decoder and first agreement can be said server and accept;
When both accept for said client-server, accept proposed coding decoder and proposed agreement;
In the web service environment, in single step, set up a media channel and a signaling channel, proposed first coding decoder and proposed first agreement will be used in said media channel; And
Through said media channel and said signaling channel exchange message between said client computer and said server.
2. the method for claim 1 is characterized in that, also is included in proposed first agreement and proposed first coding decoder when not accepting for server, utilizes said server to come to propose second agreement and second coding decoder to said client computer.
3. the method for claim 1 is characterized in that, sets up said media channel and comprises statement Internet Protocol address and the port relevant with said Internet Protocol address.
4. the method for claim 1 is characterized in that, further comprising provides the tabulation that comprises at least one coding decoder and at least one agreement that is used to set up said media channel.
5. the method for claim 1 is characterized in that, said exchange message comprises transmitting audio data on media channel.
6. a method that is used to provide voice service is characterized in that, said method comprises:
Signaling channel through according to the signaling protocol of having set up receives signaling information;
Pass through the media channel receiving speech information according to coding decoder of having set up and agreement, wherein said media channel is along with overtime and expired; And
In the web service environment, handle said signaling information and said voice messaging,
Wherein said signaling channel and said media channel are set up in single step.
7. like the said method of claim 6, it is characterized in that said method further is included on the said voice messaging and carries out speech recognition.
8. like the said method of claim 6, it is characterized in that said method further comprises provides computer support telecommunication application program (CSTA) interface.
9. like the said method of claim 6, it is characterized in that said method further comprises explains Simple Object Access Protocol (SOAP) message.
10. like the said method of claim 6, it is characterized in that said method further comprises handles the semantic information that said voice messaging wherein comprises with identification.
11., it is characterized in that said method further comprises to the particular port relevant with Internet Protocol (IP) address sends information like the said method of claim 6.
12., it is characterized in that said method further comprises sends Simple Object Access Protocol (SOAP) message like the said method of claim 6.
13. the method for a process information in computer network is characterized in that, comprising:
In the web service environment through setting up media channel in the once exchange between client-server and setting up signaling channel opening relationships between said client computer and said server;
Send data from said client computer to said server according to specific protocol, said data comprise voice data or text data;
If said data are voice datas, convert said data into text data from voice data, if said data are text datas, convert said data into voice data from text data; And
Data after will changing according to said specific protocol send to said client computer from said server.
14., it is characterized in that said specific protocol is based on CSTA's (computer support telecommunication application program) like the said method of claim 13.
CN 200510113305 2004-10-22 2005-09-22 Distributed speech service Expired - Fee Related CN1764190B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US62130304P 2004-10-22 2004-10-22
US60/621,303 2004-10-22
US11/058,892 2005-02-16
US11/058,892 US8396973B2 (en) 2004-10-22 2005-02-16 Distributed speech service

Publications (2)

Publication Number Publication Date
CN1764190A CN1764190A (en) 2006-04-26
CN1764190B true CN1764190B (en) 2012-12-12

Family

ID=36748130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510113305 Expired - Fee Related CN1764190B (en) 2004-10-22 2005-09-22 Distributed speech service

Country Status (2)

Country Link
CN (1) CN1764190B (en)
ZA (1) ZA200507606B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8761132B2 (en) * 2006-10-31 2014-06-24 Cisco Technology, Inc. Enhanced wireless voice services using a signaling protocol
CN103151041B (en) * 2013-01-28 2016-02-10 中兴通讯股份有限公司 A kind of implementation method of automatic speech recognition business, system and media server
RU2658602C2 (en) * 2013-08-29 2018-06-22 Юнифай Гмбх Унд Ко. Кг Maintaining audio communication in an overloaded communication channel
US10069965B2 (en) 2013-08-29 2018-09-04 Unify Gmbh & Co. Kg Maintaining audio communication in a congested communication channel

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
M. Handley等.SDP: Session Description Protocol.RFC2327.1998,21,26页. *

Also Published As

Publication number Publication date
CN1764190A (en) 2006-04-26
ZA200507606B (en) 2007-05-30

Similar Documents

Publication Publication Date Title
AU2005211611B2 (en) Distributed speech service
US6801604B2 (en) Universal IP-based and scalable architectures across conversational applications using web services for speech and audio processing resources
CN100524280C (en) Method and apparatus for participating communication conversation
CN101297541B (en) Communications between devices having different communication modes
US8508569B2 (en) Video communication method and system
CN101103612A (en) Dynamic extensible lightweight access to web services for pervasive devices
CN1585335A (en) Service providing system, method and device, service providing program and recording medium
WO2006025461A1 (en) Push information communication system accompanied by telephone communication
US7295984B2 (en) Systems and methods for providing voice and data interfaces to web services-based applications
KR20030076718A (en) Method and device for accessing files stored in a mobile terminal device supporting an internet protocol
CN1764190B (en) Distributed speech service
Van Dyke et al. Media server control markup language (MSCML) and protocol
JP4867321B2 (en) Connection control apparatus and method, and program
US8224975B1 (en) Web service initiation protocol for multimedia and voice communication over internet protocol
JP5100574B2 (en) Terminal device, program download method, program, recording medium, and program providing system
Rosenberg A Framework for Application Interaction in the Session Initiation Protocol (SIP)
CN101690114A (en) Real time composition of services
JP4767821B2 (en) Service cooperation method, transfer device, and program
TWI811644B (en) Method for dynamically connecting a communication channel and software system using the same
Liscano et al. Projecting Web services using presence communication protocols for pervasive computing
JP2005286475A (en) Gateway device, communication service connection method therein, and program
KR20110131623A (en) Method and system for providing call service using tag
Van Dyke et al. RFC 4722: Media Server Control Markup Language (MSCML) and Protocol
Van Dyke et al. RFC 5022: Media Server Control Markup Language (MSCML) and Protocol
Maes A call control driven MVC programming model for mixing Web and call or multimedia applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121212

Termination date: 20140922

EXPY Termination of patent right or utility model