CN102427465A

CN102427465A - Voice service proxy method and device and system for integrating voice application through proxy

Info

Publication number: CN102427465A
Application number: CN2011102382026A
Authority: CN
Inventors: 朱敏
Original assignee: Qingdao Hisense Electronics Co Ltd
Current assignee: Qingdao Hisense Electronics Co Ltd
Priority date: 2011-08-18
Filing date: 2011-08-18
Publication date: 2012-04-25
Anticipated expiration: 2031-08-18
Also published as: CN102427465B

Abstract

The invention provides voice service proxy method and device and a system for integrating voice application through proxy. The voice service proxy method usually comprises the following steps of: receiving a voice request sent by voice application; recognizing the received voice request through the common voice service communication protocol; performing task assignment for the voice request according to the recognition result; acquiring data corresponding to the voice request according to the task assignment result; sending the acquired data to a voice cloud server to process the same according to the voice request, and acquiring the data processing result from the voice cloud server after processing is performed; and returning the processing result to the voice application sending the voice request. In the invention, the application does not necessarily concern the embodiment of the voice technology, the voice function can be added only through simple interface invoking and message communication with voice service proxy, the voice processing function is simply and conveniently integrated in ordinary application, and the development of a separate voice database for each application is not necessary.

Description

Voice service Proxy Method and device, through acting on behalf of the system that integrated speech is used

Technical field

The present invention relates to human-computer interaction technique field, particularly a kind of voice service Proxy Method and device, through acting on behalf of the system that integrated speech is used.

Background technology

Fast development of computer technology makes miniaturization and the integrated possibility that becomes with abundant equipment of using, by and the various smart machines that come emerge in an endless stream, to daily life great facility is provided.For these smart machines, the convenience of man-machine interaction mode and ease for use become a major criterion of valuator device ability between user and the equipment.

Such as in panel TV, because networking and intelligentized trend are more and more obvious, in the face of the information such as audio frequency and video of magnanimity, man-machine interaction is more and more important.As the important component part of man-machine interaction, input mode has great significance to user experience, yet the user imports Chinese and bothers very much in the existing TV.The speech recognition technology of for this reason introducing is the supplementary means of phonetic entry as a kind of input in Chinese; To let the user experience the glamour of voice technology easily, promote the user experience of using.Simultaneously, also through speech synthesis technique, let the user not only can see news in the prior art, also can listen news through TV simultaneously, these can both let the user experience the enjoyment of user experience.

But in the prior art; The speech recognition or the developer of speech synthesis technique often only lay particular emphasis on identification and synthetic concrete technology; Perhaps in some application that oneself is developed, these technology are provided, often do not consider the speech recognition or the versatility of speech synthesis technique.

In realizing process of the present invention; The inventor finds to have following problem in the prior art at least: in present smart machine, various new application constantly are developed, in the intelligent platform like existing panel TV; (SuSE) Linux OS is not only arranged, Android operating system is also arranged; Simultaneously, on these different operating systems, also can have the multiple application of using the different language exploitation, development language possibly be C/C++, also possibly be Java or JSP etc.Much can further provide speech recognition or phonetic synthesis service to strengthen the convenience of user experience and application in these concrete application; But because the difference of operating system or development language; These different application of developing separately are also incompatible usually, and speech recognition that it provides or phonetic synthesis service also are based on the sound bank of independent exploitation usually.

If utilize other to use existing sound bank integrated speech identification or phonetic synthesis service in Another Application, must use this sound bank to carry out secondary development to applied operating system and development language.This requires the developer not only need understand relevant interfaces such as voice collecting, speech recognition, phonetic synthesis, speech play; Also need transplant to the different development environments of different application; Let the difficulty of using secondary development become very big, not second to designing brand-new application system.And if design has the application software of phonetic function fully again, become the duplication of labour again with existing sound bank service of having developed, obviously caused the waste on system resource, human input, time cost and the development efficiency.

Summary of the invention

The technical problem that (one) will solve

To above-mentioned shortcoming; The present invention for solve in the prior art can not be in application the problem of fast integration phonetic function; A kind of voice service Proxy Method and device and a kind of through acting on behalf of the system that integrated speech is used are provided, and the application software of make under the different system, different language being developed can be used same sound bank easily, same speech recognition and/or phonetic synthesis service are provided.

(2) technical scheme

In order to solve the problems of the technologies described above, on the one hand, the invention provides a kind of voice service Proxy Method, said method comprises step:

S1 receives the voice request that voice application is sent;

S2, the said voice request that identification receives according to the universal phonetic service communication protocols;

S3 carries out task assignment according to recognition result to said voice request;

S4 obtains the corresponding data of said voice request according to the task assignment result; The data of obtaining are sent to the voice Cloud Server, according to said voice request data are handled, the back of finishing dealing with obtains the result of data from said voice Cloud Server;

S5 returns said result to the said voice application of sending said voice request.

On the other hand, the present invention also provides a kind of voice service agent apparatus simultaneously, and said device comprises:

The request receiving element is used to receive the voice request that voice application is sent;

The request recognition unit is used for the said voice request that identification receives according to the universal phonetic service communication protocols;

The task assignment unit is used for according to recognition result said voice request being carried out task assignment;

Task realizes the unit, is used for obtaining the corresponding data of said voice application according to the task assignment result; And the data of obtaining are sent to the voice Cloud Server according to said voice request data are handled, the back of finishing dealing with obtains the result of data from said voice Cloud Server;

Feedback unit is used for returning said result to the said voice application of sending said voice request as a result.

Again on the one hand, it is a kind of through acting on behalf of the system that integrated speech is used that the present invention also provides simultaneously, and said system comprises:

At least one voice application means receives user's voice identification and/or phonetic synthesis request and through integrated unified calling interface said speech recognition and/or phonetic synthesis request is sent to the voice service agent apparatus;

The voice service agent apparatus is discerned said speech recognition and/or the phonetic synthesis request that receives through integrated universal phonetic service communication protocols; Corresponding at least one task realization unit is tasked in said speech recognition and/or phonetic synthesis request branch; Mutual by said task realization unit and voice Cloud Server, obtain said speech recognition and/or phonetic synthesis processing of request result and return to said at least one voice application means;

The voice Cloud Server realizes that to corresponding at least one said task the data of unit transmission are carried out speech recognition and/or phonetic synthesis is handled, and result is returned to the voice service agent apparatus.

(3) beneficial effect

In technique scheme of the present invention, make all application can both use speech recognition and/or phonetic synthesis to use through the voice service agency.In technique scheme of the present invention; Application need not be concerned about the concrete realization of voice technology; Only need interface interchange through simple voice service proxy server; Just can let use to increase phonetic function, thereby has realized in common application integrated speech processing capacity simply and easily, and need not to carry out the exploitation of sound bank separately to each application.

Description of drawings

Fig. 1 is the general handling process sketch map of voice service Proxy Method in the embodiment of the invention;

Fig. 2 is the schematic flow sheet that voice service Proxy Method processed voice is discerned request and/or phonetic synthesis request in the embodiment of the invention;

Fig. 3 is the unit structure figure of voice service agent apparatus in the embodiment of the invention;

Fig. 4 is through acting on behalf of the system architecture diagram that integrated speech is used in the embodiment of the invention.

Embodiment

To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment is a part of embodiment of the present invention, rather than whole embodiment.Based on the embodiment among the present invention, the every other embodiment that those of ordinary skills are obtained under the prerequisite of not making creative work belongs to the scope that the present invention protects.

In an embodiment of the present invention; For application provides a TCP/IP universal phonetic service communication protocols based on the Socket interface exploitation; No matter the development language of application program is based on C/C++ or JAVA/JSP; Can carry out the TCP/IP communication through this communication protocol and voice service agency as long as use, adopt interacting message to obtain voice service agency's support, just can use speech recognition and speech-sound synthesizing function.Thereby in an embodiment of the present invention, application need not be considered the concrete realization of phonetic function, also need not to consider and the reciprocal process of voice Cloud Server, can carry out the transmitting-receiving of TCP/IP message according to the agreement and the voice service agency of communication protocol as long as use.All phonetic functions are all acted on behalf of by voice service and are realized; The user only need be by the protocol format order that initiates a message; Tell the voice service agent application wants what is done, voice service agency recognition command automatically goes to carry out function corresponding, and returns the result of execution.Rely on this universal phonetic service communication protocols; Identification and processing in the voice service agent apparatus, have really been realized to voice request; Type according to the format identification request of agreement; Type by request is carried out corresponding processing and is obtained result, uses through the TCP/IP message informing of this agreement agreement again, lets use and obtains result.

Particularly, as shown in Figure 1, the general step of the voice service Proxy Method in the embodiment of the invention is: receive the voice request that voice application is sent; The voice request that identification receives according to the universal phonetic service communication protocols; According to recognition result voice request is carried out task assignment; Obtain the corresponding data of voice request according to the task assignment result; The data of obtaining are sent to the voice Cloud Server, according to voice request data are handled, the back of finishing dealing with obtains the result of data from the voice Cloud Server; Voice application to sending voice request is returned result.

In an embodiment of the present invention, adopt local Socket communication according to said universal phonetic service communication protocols between user's application module and the voice service agency, the voice service agency waits for the connection of user's application module after startup as the service end of TCP; User's application module is kept long the connection with the voice service agency, if do not do extra agreement, the voice service agency will use 20000 as listening port.

The general message frame form of said universal phonetic service communication protocols is as shown in table 1, and a frame message is made up of frame head, frame type, content frame length, content frame, verification and postamble each several part; Each several part length is respectively frame head 1 byte, frame type 1 byte, content frame length 2 bytes, content frame n byte (look particular content and decide), verification 1 byte and postamble 1 byte:

Title

Frame head

Frame type

Content frame length

Content frame

Verification

Postamble

Length (Byte)

1

2

n

1

Content

0x16

0x01-0x08

n

xxxxxx

Check value

0x96

Table 1 frame format

Particularly, frame head is identified by character 0x16 usually; Frame type has 8 types (are merely exemplary illustration here, it will be understood by those skilled in the art that according to the concrete sound function expansion, also can suitably increase or adjust frame type, so that for using more support is provided), representes with character 0x01-0x08 respectively; The byte number of the physical length of content frame length part identification frames content; The actual content of this frame of content frame record; The check value of check part record frame information; Postamble is identified by character 0x96 usually.

In the embodiments of the invention respectively to call, reporting events, status poll and 4 types of situation of service stopping provide and have been used for mutual communication information, promptly every type of situation provides 2 kinds of message frames, the concrete form of whole 8 kinds of message frames of this agreement is as shown in table 2:

The concrete form of table 2 message frame

In implementation process of the present invention; The developer need not application is rewritten; More needn't write API again to the development language of using and the operating system of dependence; Only be required to be and use the calling interface that the above-mentioned message frame of transmitting-receiving is provided, use the form and the voice service agency that adopt the agreement agreement and carry out the support that message communicating can obtain voice service, thereby the present invention has realized the support to multilingual and operating system.The interface of application call speech-recognition services with speech-recognition services for instance, is following:

Void ListenBegin (string strScene); / * startup identification */

Void ListenCancel (); / * cancellation identification */

String ListenGetHeardWords (iIndex); The original text * of/* acquisition recognition result/

String ListenGetHeardUri (iIndex); The description character * of/* acquisition recognition result/

String ListenGetVersion (); / * obtain current identification service release */

Further, to a concrete application scenarios---use to identification module and send sign on, specifying the identification scene is " Hisense's video display ", and what then socket sent thes contents are as follows:

{“\x16\x01”，”\x18\x00”，”ListenBegin?hisense_film”，”\x03”，”\x96”}；

If function normally returns, return results is following:

" x16 ", " OK " " x0E ", " x96 "; / * call successfully */

Adopt this mode; Application only need be carried out the TCP/IP interacting message and get final product the integrated speech service; And can be in application process according to user's demand query processing state or stop service at any time; After processing finishes, also can obtain result again according to user's indication, operation has bigger flexibility.

In an embodiment of the present invention; Voice application specifically comprises speech recognition application and/or phonetic synthesis application; Thereby the speech-recognition services like phonetic entry or phonetic search can be provided in common intelligent use simultaneously, and as reading phonetic synthesis (sound reading) service of news or sound magazine class.

Below in conjunction with accompanying drawing embodiments of the invention are done explanation further; As shown in Figure 2, the method in the embodiment of the invention can be that the application of speech recognition application and/or phonetic synthesis provides the service broker simultaneously, realizes different application request through uniformly request being discerned with task assignment; Application need not to carry out the sound bank exploitation of repetition; Also need not to pay close attention to concrete voice technology and realize, thereby its method applicability is strong, flexibility ratio is high, is with a wide range of applications.Particularly,

When carrying out speech recognition application, the specific descriptions of method step are following:

Voice application is sent the request of " speech recognition begins " to the voice service agency;

The voice service agency request that identification receives according to the universal phonetic service communication protocols is the speech recognition request;

According to recognition result request is assigned as voice recognition tasks;

Report the message of " recording " to give application, and open Mic, begin recording; Behind the End of Tape, report " just in speech recognition ", and the voice flow of recording is uploaded to the voice Cloud Server discern to application;

Obtain recognition result from the voice Cloud Server, and report " identification message of successful " to the voice application of the request of sending;

After voice application obtained discerning message of successful, to the request that the voice service agency sends " acquisition recognition result ", the voice service agency was to using the result of sending speech recognition.

Wherein, make mistakes in arbitrary link of handling, the voice service agency need send error message to using; In the process of speech recognition, use the request that also can send " cancellation speech recognition " to the voice service agency, act on behalf of by voice service and stop relevant speech recognition process.

And when carrying out the phonetic synthesis application, the specific descriptions of method step are following:

The voice application acquisition needs synthetic text message and sends the request of " phonetic synthesis begins " to the voice service agency;

After the voice service agency received request, the request that identification receives according to the universal phonetic service communication protocols was the phonetic synthesis request;

According to recognition result request is assigned as the phonetic synthesis task;

Report " just in phonetic synthesis " to application, and begin to receive text message, text message is issued the voice Cloud Server, the voice Cloud Server calls speech synthesis engine and synthesizes, and generates the voice flow corresponding with text message;

Obtain voice flow from the voice Cloud Server, call corresponding playback interface and play through loud speaker or earphone;

Report " playing progress rate " to application, like which sentence of current broadcast, which word etc.

Wherein, make mistakes in arbitrary link of handling, the voice service agency need send error message to using; In the process of phonetic synthesis, use the request that also can send " cancellation phonetic synthesis " to the voice service agency, act on behalf of the relevant phonetic synthesis process that stops by voice service.

In order further to increase the stability of method in the embodiment of the invention; Adopt Transmission Control Protocol to carry out communication between voice application and the voice service agency, and adopt asynchronous control mode, can avoid the obstruction of voice application like this based on Socket; Let voice application can in time handle the message that remote keying message and voice service agency reports; In time carry out the renewal that refreshes and point out of UI, make the application and development of voice more flexible, and can let the user obtain better to control experience.

One of ordinary skill in the art will appreciate that; Realize that all or part of step in the foregoing description method is to instruct relevant hardware to accomplish through program; Described program can be stored in the computer read/write memory medium; This program comprise each step of the foregoing description method, and described storage medium can be: ROM/RAM, magnetic disc, CD etc. when carrying out.

On the other hand, a kind of voice service agent apparatus is provided simultaneously also in the embodiments of the invention, as shown in Figure 3, the voice service agent apparatus specifically comprises:

The request recognition unit is used for the described request that identification receives according to the universal phonetic service communication protocols;

Voice recognition tasks realizes the unit, and the result handles voice recognition tasks according to task assignment, reports the message of " recording " to give application, and opens Mic, begins recording; Behind the End of Tape, report " just in speech recognition ", and the voice flow of recording is uploaded to the voice Cloud Server discern to application; Obtain recognition result from the voice Cloud Server, and report " identification message of successful " to application; And/or

The phonetic synthesis task realizes the unit; According to the task assignment result phonetic synthesis task is handled; Report " just in phonetic synthesis " to application, and begin to receive text message, text message is issued the voice Cloud Server; The voice Cloud Server calls speech synthesis engine and synthesizes, and generates the voice flow corresponding with text message; Obtain voice flow, call corresponding playback interface and play through loud speaker or earphone; And

Feedback unit is used for sending said result to said voice application as a result.Particularly, when speech recognition, after voice application obtains discerning message of successful, can be to the request of voice service agency's transmission " acquisition recognition result ", feedback unit is to using the result of sending speech recognition as a result; When phonetic synthesis, report " playing progress rate " to application, like which sentence of current broadcast, which word etc.

Further, this device also provides voice service agency's integration calling interface through interface unit to common application, makes that common application can be through the simple integrated voice application that just forms.

In addition, this device is through carrying out the acknowledgement messaging of asynchronous control mode between response unit and the said voice application.Equipment calls unit in the device is used to different voice different external equipment is called, and particularly, when receiving the request of speech recognition application, equipment calls cell call microphone obtains voice flow data to be identified; And when receiving the request that phonetic synthesis is used, after the processing of voice Cloud Server completion phonetic synthesis, call loud speaker or earphone and play the voice flow after synthesizing.

Further, this device also feeds back to voice application transmission error message through the error feedback unit; Through handle stopping the unit voice responsive request of using at any time to stop the treatment progress of any unit.

Again on the one hand; Also provide a kind of in the embodiments of the invention simultaneously through acting on behalf of the system that integrated speech is used; As shown in Figure 4, said system specifically comprises at least one voice application means (speech recognition application device and/or phonetic synthesis application apparatus), voice service agent apparatus and voice Cloud Server.

Wherein, Voice application means is the equipment that concrete application is provided for the user; Like panel TV, mobile phone, portable mobile apparatus, personal computer etc., can receive user's voice identification and/or phonetic synthesis request and said speech recognition and/or phonetic synthesis request sent to the voice service agent apparatus through integrated unified calling interface;

The voice service agent apparatus is to carry out mutual equipment through universal phonetic service communication protocols and at least one voice application means and voice Cloud Server; The voice service agent apparatus can be a server independently in the network; Also can integrate, can also be split as a plurality of equipment different server node in network service is provided with voice application means.Said speech recognition and/or phonetic synthesis request that the identification of voice service agent apparatus receives; Corresponding at least one task realization unit is tasked in said speech recognition and/or phonetic synthesis request branch; Mutual by said task realization unit and voice Cloud Server, obtain said speech recognition and/or phonetic synthesis processing of request result and return to said at least one voice application means.

The voice Cloud Server is the concrete sound service processing apparatus that has sound bank; The voice Cloud Server can be a server independently in the network; The server set of different phonetic service also can a plurality ofly be provided; Sound bank can be by providing in the separate server, and also can integrate with a certain voice server provides; Preferably, the voice Cloud Server is to be made up of jointly a plurality of server nodes that are randomly dispersed in the network, selects wherein a certain node to handle concrete voice application according to certain scheduling strategy, thereby guarantees service quality.The voice Cloud Server realizes that according to corresponding at least one task the data of unit transmission are carried out speech recognition and/or phonetic synthesis is handled, and result is returned to the voice service agent apparatus.

In the embodiments of the invention; Can make full use of existing sound bank and voice service and need not to carry out complicated secondary development; Application facet only need be through the Transmission Control Protocol of standard; The agency sends command request to voice service, gets final product through the message that Transmission Control Protocol receives and the processed voice service broker reports then.The voice service Agency, uniformly request is discerned and carried out task assignment, thereby can dispatch processing to various dissimilar application request adaptively, have stronger adaptability and flexibility ratio.In an embodiment of the present invention; Those correspondences very big part of degree of raising difficult questions that is used for explaining clearly; Like the concrete realization of hardware platform, development environment, speech recognition and phonetic synthesis etc., all give the voice service agency and forward concrete voice Cloud Server to and come Unified Treatment, and the voice Cloud Server can utilize same sound bank; Adopt the universal phonetic service communication protocols to carry out getting final product alternately with the outside, avoided the overlapping development of sound bank and service.Correspondence is used; Because Transmission Control Protocol is the standard agreement with platform and language independent; So when on embedded platform, doing application and development; But also just get around the integrated barrier of platform, operating system, speech recognition SDK and phonetic synthesis SDK, let the application and development integrated speech function no longer be a kind of burden, but a kind of method that promotes using value efficiently.

By the way, the present invention proposes agreement and interfacing to different application scenes and application software, relevant interface can increase the phonetic function support very fast in common application program, and the Maintenance free speech engine; The new voice technology of expansion also will be very easy in various existing application.Particularly, through the service broker, all application can both be acted on behalf of through voice service use speech recognition and/or phonetic synthesis to use.In the present invention, use the concrete realization that to be concerned about voice technology fully, only need interface interchange through simple unified voice service proxy server; Carry out message communicating with the voice service agency and obtain the voice service support; Just can let use to increase phonetic function, thereby has realized in common application integrated speech processing capacity simply and easily, and need not to carry out the exploitation of sound bank separately to each application; Save the wasting of resources greatly, improved development efficiency.

Above execution mode only is used to explain the present invention; And be not limitation of the present invention; The those of ordinary skill in relevant technologies field under the situation that does not break away from the spirit and scope of the present invention, can also be made various variations and modification; Therefore all technical schemes that are equal to also belong to category of the present invention, and real protection scope of the present invention should be defined by the claims.

Claims

1. a voice service Proxy Method is characterized in that, said method comprises step:

S1 receives the voice request that voice application is sent;

2. method according to claim 1 is characterized in that, integrated speech service broker's unified calling interface is to form said voice application in common application.

3. method according to claim 1 is characterized in that, and the asynchronous control mode utilization of the mutual employing between the said voice application is carried out acknowledgement messaging based on the Transmission Control Protocol of Socket.

4. method according to claim 1 is characterized in that, when step S2 discerns said voice request and is the speech recognition request, among the step S3, according to recognition result said speech recognition request is assigned as voice recognition tasks;

Among the step S4, call microphone and record and obtain voice flow data to be identified; The voice flow data of obtaining are sent to the voice Cloud Server, carry out voice recognition processing, the back of finishing dealing with obtains the text message behind the said voice flow of identification from said voice Cloud Server;

Among the step S5, said text message is returned to the voice application of sending said speech recognition request, said voice application shows said text message or carries out the pairing operation of said text message.

5. method according to claim 1 is characterized in that, when step S2 discerns said voice request and is the phonetic synthesis request, among the step S3, according to recognition result said phonetic synthesis request is assigned as the phonetic synthesis task;

Among the step S4, obtain text message data to be synthesized from said voice application; The text message data of obtaining are sent to the voice Cloud Server, carry out phonetic synthesis and handle, the back of finishing dealing with obtains the voice flow that said text message converts to and calls loud speaker or earphone is play said voice flow after synthetic from said voice Cloud Server;

Among the step S5, playing progress rate is returned to the voice application of sending said phonetic synthesis request.

6. a voice service agent apparatus is characterized in that, said device comprises:

7. device according to claim 6 is characterized in that said device also comprises interface unit, is used for providing to common application voice service agency's integration calling interface, makes common application form said voice application.

8. device according to claim 6 is characterized in that said device also comprises response unit, is used to adopt asynchronous control mode utilization to carry out acknowledgement messaging based on Transmission Control Protocol and the said voice application of Socket.

9. device according to claim 6 is characterized in that, when the described request recognition unit was discerned said voice request and is the speech recognition request, said task assignment unit was tasked voice recognition tasks with said speech recognition request branch and realized the unit;

Wherein, Said voice recognition tasks realizes the unit; Being used to call microphone records and obtains voice flow data to be identified; The voice flow data of obtaining are sent to the voice Cloud Server carry out voice recognition processing, the back of finishing dealing with obtains the text message behind the said voice flow of identification from said voice Cloud Server.

10. device according to claim 6 is characterized in that, when the described request recognition unit was discerned said voice request and is the phonetic synthesis request, said task assignment unit was tasked the phonetic synthesis task with said phonetic synthesis request branch and realized the unit;

Wherein, said phonetic synthesis task realizes the unit, is used for obtaining text message data to be synthesized from said voice application; The text message data of obtaining are sent to the voice Cloud Server carry out phonetic synthesis and handle, finish dealing with the back from said voice Cloud Server obtain said text message after synthetic voice flow and call loud speaker or earphone is play said voice flow after synthetic.

11. one kind through acting on behalf of the system that integrated speech is used, and it is characterized in that said system comprises: