CN102427465B

CN102427465B - Voice service proxy method and device and system for integrating voice application through proxy

Info

Publication number: CN102427465B
Application number: CN201110238202.6A
Authority: CN
Inventors: 朱敏
Original assignee: Qingdao Hisense Electronics Co Ltd
Current assignee: Qingdao Hisense Electronics Co Ltd
Priority date: 2011-08-18
Filing date: 2011-08-18
Publication date: 2015-05-13
Anticipated expiration: 2031-08-18
Also published as: CN102427465A

Abstract

The invention provides voice service proxy method and device and a system for integrating voice application through proxy. The voice service proxy method usually comprises the following steps of: receiving a voice request sent by voice application; recognizing the received voice request through the common voice service communication protocol; performing task assignment for the voice request according to the recognition result; acquiring data corresponding to the voice request according to the task assignment result; sending the acquired data to a voice cloud server to process the same according to the voice request, and acquiring the data processing result from the voice cloud server after processing is performed; and returning the processing result to the voice application sending the voice request. In the invention, the application does not necessarily concern the embodiment of the voice technology, the voice function can be added only through simple interface invoking and message communication with voice service proxy, the voice processing function is simply and conveniently integrated in ordinary application, and the development of a separate voice database for each application is not necessary.

Description

Voice service Proxy Method and device, system by integration voice application

Technical field

The present invention relates to human-computer interaction technique field, particularly a kind of voice service Proxy Method and device, system by integration voice application.

Background technology

The develop rapidly of computer technology makes to have the miniaturization of the equipment of abundant application and integratedly becomes possibility, by and the various smart machines that come emerge in an endless stream, provide a great convenience to daily life.For these smart machines, between user and equipment, the convenience of man-machine interaction mode and ease for use become a major criterion of valuator device ability.

Such as in panel TV, because networking and intelligentized trend are more and more obvious, in the face of the information such as audio frequency and video of magnanimity, man-machine interaction is more and more important.As the important component part of man-machine interaction, input mode has great significance to Consumer's Experience, but in existing TV, user inputs Chinese and bothers very much.The speech recognition technology for this reason introduced, using the supplementary means of phonetic entry as a kind of input in Chinese; By the glamour allowing user experience voice technology easily, promote the Consumer's Experience of application.Meanwhile, also by speech synthesis technique in prior art, allow user not only can see news, also listen news by TV, these can allow user experience the enjoyment of Consumer's Experience simultaneously.

But in the prior art, speech recognition or the developer of speech synthesis technique often only lay particular emphasis on the concrete technology identified with synthesis, or in some application of oneself exploitation, provide these technology, often do not consider the versatility of speech recognition or speech synthesis technique.

Realizing in process of the present invention, inventor finds that in prior art, at least there are the following problems: in current smart machine, and various application newly is constantly developed, in the intelligent platform as existing panel TV, not only there is (SuSE) Linux OS, also have Android operation system; Meanwhile, in the operating system that these are different, also can there is the multiple application using different language exploitation, development language may be C/C++, also may be Java or JSP etc.Speech recognition much can be provided further or phonetic synthesis service is to strengthen the convenience of Consumer's Experience and application in these concrete application, but due to the difference of operating system or development language, these different application of developing separately are usual and incompatible, and its speech recognition provided or phonetic synthesis service are also the sound bank based on exploitation separately usually.

If utilize other to apply existing sound bank integrated speech identification or phonetic synthesis service in Another Application, this sound bank must be used to carry out secondary development for the operating system of application and development language.This requires that developer not only needs to understand the concerned interface such as voice collecting, speech recognition, phonetic synthesis, speech play, the different development environments for different application are also needed to transplant, the difficulty of application secondary development is allowed to become very large, not second to designing brand-new application system.And if the application software redesigned completely with phonetic function, become again the duplication of labour together with existing sound bank service of having developed, obviously cause the waste in system resource, human input, time cost and development efficiency.

Summary of the invention

(1) technical problem that will solve

For above-mentioned shortcoming, the present invention can not the problem of fast integration phonetic function in the application in order to solve in prior art, provide a kind of voice service Proxy Method and device and a kind of system by integration voice application, under making different system, the application software developed of different language can use same sound bank easily, provide same speech recognition and/or phonetic synthesis service.

(2) technical scheme

In order to solve the problems of the technologies described above, on the one hand, the invention provides a kind of voice service Proxy Method, described method comprises step:

S1, receives the voice request that voice application sends;

S2, according to the described voice request that the identification of universal phonetic service communication protocols receives;

S3, carries out task assignment according to recognition result to described voice request;

S4, obtains data corresponding to described voice request according to task assignment result; The data of acquisition are sent to speech cloud server, according to described voice request, data is processed, after having processed, obtain the result of data from described speech cloud server;

S5, returns described result to the described voice application sending described voice request.

On the other hand, the present invention also provides a kind of voice service agent apparatus simultaneously, and described device comprises:

Request reception unit, for receiving the voice request that voice application sends;

Request recognition unit, for the described voice request received according to the identification of universal phonetic service communication protocols;

Task assignment unit, for carrying out task assignment according to recognition result to described voice request;

Task realizes unit, for obtaining data corresponding to described voice application according to task assignment result; And send to speech cloud server to process data according to described voice request the data of acquisition, the result of data is obtained after having processed from described speech cloud server;

Result feedback unit, for returning described result to the described voice application sending described voice request.

Again on the one hand, the present invention also provides a kind of system by integration voice application simultaneously, and described system comprises:

At least one voice application means, receives the speech recognition of user and/or phonetic synthesis request and by integrated unified call interface, described speech recognition and/or phonetic synthesis request is sent to voice service agent apparatus;

Voice service agent apparatus, the described speech recognition received by integrated universal phonetic service communication protocols identification and/or phonetic synthesis request; At least one task accordingly of tasking described speech recognition and/or phonetic synthesis request is divided to realize unit; Realize unit and speech cloud server interaction by described task, obtain the result of described speech recognition and/or phonetic synthesis request and return at least one voice application means described;

Speech cloud server, realizes to corresponding task described at least one data that unit sends and carries out speech recognition and/or phonetic synthesis process, result is returned to voice service agent apparatus.

(3) beneficial effect

In technique scheme of the present invention, make all application can use speech recognition and/or phonetic synthesis application by voice service agency.In technique scheme of the present invention, application need not be concerned about the specific implementation of voice technology, only need the interface interchange by simple voice service proxy server, application just can be allowed to increase phonetic function, thus to achieve in common application integrated speech processing capacity simply and easily, and without the need to carrying out separately the exploitation of sound bank for each application.

Accompanying drawing explanation

Fig. 1 is the general handling process schematic diagram of voice service Proxy Method in the embodiment of the present invention;

Fig. 2 is the schematic flow sheet of the identification request of voice service Proxy Method processed voice and/or phonetic synthesis request in the embodiment of the present invention;

Fig. 3 is the unit structure figure of voice service agent apparatus in the embodiment of the present invention;

Fig. 4 is the system architecture diagram by integration voice application in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is a part of embodiment of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under the prerequisite not making creative work, all belongs to the scope of protection of the invention.

In an embodiment of the present invention, for application provides a TCP/IP universal phonetic service communication protocols based on Socket interface exploitation, no matter Develop Application System language is based on C/C++ or JAVA/JSP, as long as application is acted on behalf of by this communication protocol and voice service carry out TCP/IP communication, adopt interacting message to obtain the support of voice service agency, just can use speech recognition and speech-sound synthesizing function.Thus in an embodiment of the present invention, application need not consider the specific implementation of phonetic function, also without the need to considering the reciprocal process with speech cloud server, as long as application can act on behalf of according to the agreement of communication protocol and voice service the transmitting-receiving carrying out TCP/IP message.All phonetic functions are all acted on behalf of by voice service and are realized, user only needs to initiate a message order by protocol format, tell voice service agent application wants what does, voice service agency will go to perform corresponding function by recognition command automatically, and return the result of execution.Rely on this universal phonetic service communication protocols, the identification to voice request and process is really achieved in voice service agent apparatus, according to the type of the format identification request of agreement, carry out corresponding process by the type of request and obtain result, applied by the TCP/IP message informing of this protocol conventions again, allow application obtain result.

Particularly, as shown in Figure 1, the general step of the voice service Proxy Method in the embodiment of the present invention is: receive the voice request that voice application sends; According to the voice request that the identification of universal phonetic service communication protocols receives; According to recognition result, task assignment is carried out to voice request; Data corresponding to voice request are obtained according to task assignment result; The data of acquisition are sent to speech cloud server, according to voice request, data is processed, after having processed, obtain the result of data from speech cloud server; Result is returned to the voice application sending voice request.

In an embodiment of the present invention, adopt local Socket communicate according to described universal phonetic service communication protocols between user's application module and voice service agency, the connection waiting for user's application module as the service end of TCP is upon actuation acted on behalf of in voice service; User's application module and voice service are acted on behalf of to maintain to grow and are connected, if do not do extra agreement, voice service is acted on behalf of use 20000 as listening port.

The general message frame of described universal phonetic service communication protocols is as shown in table 1, and a frame message is made up of frame head, frame type, content frame length, content frame, School Affairs postamble each several part; Each several part length is respectively frame head 1 byte, frame type 1 byte, content frame length 2 byte, content frame n byte (depending on particular content), verifies 1 byte and postamble 1 byte:

Title

Frame head

Frame type

Content frame length

Content frame

Verification

Postamble

Length (Byte)

1

2

n

1

Content

0x16

0x01-0x08

n

xxxxxx

Check value

0x96

Table 1 frame format

Particularly, frame head is identified by character 0x16 usually; Frame type has 8 classes (to be only exemplary illustration herein, it will be understood by those skilled in the art that the expansion according to concrete sound function, also can suitably increase or adjust frame type, so that for application provides more support), represent with character 0x01-0x08 respectively; The byte number of the physical length of content frame length thereof identification frames content; The actual content of this frame of content frame record; The check value of check part record frame information; Postamble is identified by character 0x96 usually.

In embodiments of the invention respectively for calling, reporting events, status poll and service stopping 4 class situation provide for mutual communication information, namely every class situation provides 2 kinds of message frames, and the concrete form of whole 8 kinds of message frames of this agreement is as shown in table 2:

The concrete form of table 2 message frame

In implementation process of the present invention, developer is without the need to rewriting application, more again need not write API for the operating system of the development language of application and dependence, only be required to be application and the calling interface receiving and dispatching above-mentioned message frame is provided, the form of application employing protocol conventions and voice service are acted on behalf of and are carried out the support that message communicating can obtain voice service, thus present invention achieves the support to multilingual and operating system.With speech-recognition services for example, the interface of application call speech-recognition services is as follows:

Void ListenBegin (string strScene); / * startup identification */

Void ListenCancel (); / * cancellation identification */

String ListenGetHeardWords (iIndex); / * obtain recognition result original text */

String ListenGetHeardUri (iIndex); / * obtain recognition result description character */

String ListenGetVersion (); / * obtain current identification service release */

Further, for a concrete application scenarios---apply and send sign on to identification module, specify and identify that scene be " Hisense's video display ", then the content of socket transmission is as follows:

{“\x16\x01”，”\x18\x00”，”ListenBegin hisense_film”，”\x03”，”\x96”}；

If function normally returns, return results as follows:

" x16 x02 x02 x00 ", " OK " " x0E ", " x96 " }; / * call successfully */

Adopt in this way, application only need be carried out TCP/IP interacting message and get final product integrated speech service, and can according to the demand query processing state at any time of user or stopping service in application process, after process terminates, also can obtain result again according to the instruction of user, operation has larger flexibility.

In an embodiment of the present invention, voice application specifically comprises speech recognition application and/or phonetic synthesis application, thus speech-recognition services as phonetic entry or phonetic search can be provided in common intelligent use simultaneously, and phonetic synthesis (sound reading) service as read news or sound magazine class.

Below in conjunction with accompanying drawing, further description is done to embodiments of the invention, as shown in Figure 2, method in the embodiment of the present invention can simultaneously for speech recognition application and/or phonetic synthesis application provide service broker, by identifying and task assignment realizes different application request request uniformly, apply the sound bank exploitation without the need to carrying out repetition, also realize without the need to paying close attention to concrete voice technology, thus its method applicability is strong, flexibility ratio is high, is with a wide range of applications.Particularly,

When carrying out speech recognition application, the specific descriptions of method step are as follows:

Voice application sends the request of " speech recognition starts " to voice service agency;

The request that voice service agency receives according to the identification of universal phonetic service communication protocols is speech recognition request;

According to recognition result, request is assigned as voice recognition tasks;

Report the message of " recording " to application, and open Mic, start recording; After End of Tape, report " just in speech recognition " to application, and the voice flow of recording is uploaded to speech cloud server identify;

Obtain recognition result from speech cloud server, and report " identifying successful message " to the voice application sent request;

After voice application obtains identifying successful message, send the request of " acquisition recognition result " to voice service agency, voice service agency sends the result of speech recognition to application.

Wherein, make mistakes in arbitrary link of process, voice service agency needs to send error message to application; In the process of speech recognition, application also can send the request of " cancellation speech recognition " to voice service agency, acted on behalf of stop dependent voice identifying by voice service.

And when carrying out phonetic synthesis application, the specific descriptions of method step are as follows:

Voice application obtains the text message of needs synthesis and acts on behalf of the request sending " phonetic synthesis starts " to voice service;

After voice service agency receives request, be phonetic synthesis request according to the request that the identification of universal phonetic service communication protocols receives;

According to recognition result, request is assigned as phonetic synthesis task;

Report " just in phonetic synthesis " to application, and start to receive text message, text message is issued speech cloud server, speech cloud server calls speech synthesis engine synthesizes, and generates the voice flow corresponding with text message;

Obtain voice flow from speech cloud server, call corresponding playback interface and play by loud speaker or earphone;

Report " playing progress rate " to application, as which sentence of current broadcasting, which word etc.

Wherein, make mistakes in arbitrary link of process, voice service agency needs to send error message to application; In the process of phonetic synthesis, application also can send the request of " cancellation phonetic synthesis " to voice service agency, acted on behalf of stop dependent voice synthesis process by voice service.

In order to increase the stability of method in the embodiment of the present invention further, the Transmission Control Protocol based on Socket is adopted to carry out communication between voice application and voice service agency, and adopt asynchronous control mode, the obstruction of voice application can be avoided like this, allow voice application can process remote keying message in time and the message reported is acted on behalf of in voice service, carry out the refreshing of UI and the renewal of prompting in time, make the application and development of voice more flexible, and user can be allowed to obtain better manipulate experience.

One of ordinary skill in the art will appreciate that, the all or part of step realized in above-described embodiment method is that the hardware that can carry out instruction relevant by program has come, described program can be stored in a computer read/write memory medium, this program is when performing, comprise each step of above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD etc.

On the other hand, also provide a kind of voice service agent apparatus in embodiments of the invention, as shown in Figure 3, voice service agent apparatus specifically comprises simultaneously:

Request recognition unit, for the described request received according to the identification of universal phonetic service communication protocols;

Voice recognition tasks realizes unit, processes voice recognition tasks according to task assignment result, reports the message of " recording " to application, and opens Mic, start recording; After End of Tape, report " just in speech recognition " to application, and the voice flow of recording is uploaded to speech cloud server identify; Obtain recognition result from speech cloud server, and report " identifying successful message " to application; And/or

Phonetic synthesis task realizes unit, according to task assignment result, phonetic synthesis task is processed, report " just in phonetic synthesis " to application, and start to receive text message, text message is issued speech cloud server, speech cloud server calls speech synthesis engine synthesizes, and generates the voice flow corresponding with text message; Obtain voice flow, call corresponding playback interface and play by loud speaker or earphone; And

Result feedback unit, for sending described result to described voice application.Particularly, when speech recognition, after voice application obtains identifying successful message, can send the request of " acquisition recognition result " to voice service agency, result feedback unit sends the result of speech recognition to application; When phonetic synthesis, report " playing progress rate " to application, as which sentence of current broadcasting, which word etc.

Further, this device, also by the integration calling interface that interface unit provides voice service to act on behalf of to common application, makes common application just can be formed as voice application by simply integrated.

In addition, this device is by carrying out the acknowledgement messaging of asynchronous controlling mode between response unit and described voice application.Equipment calls unit in device calls different external equipments for different voice application, and particularly, when receiving the request of speech recognition application, equipment calls cell call microphone obtains voice stream data to be identified; And when receiving the request of phonetic synthesis application, complete the process of phonetic synthesis at speech cloud server after, call the voice flow after loud speaker or earphone broadcasting synthesis.

Further, this device is also fed back to voice application transmission error message by error feedback unit; The request of unit voice responsive application is at any time stopped to stop the treatment progress of any unit by process.

Again on the one hand, also provide a kind of system by integration voice application in embodiments of the invention simultaneously, as shown in Figure 4, described system specifically comprises at least one voice application means (speech recognition application device and/or phonetic synthesis application apparatus), voice service agent apparatus and speech cloud server.

Wherein, voice application means is for user provides the equipment of embody rule, as panel TV, mobile phone, portable mobile apparatus, personal computer etc., the speech recognition of user and/or phonetic synthesis request can be received and by integrated unified call interface, described speech recognition and/or phonetic synthesis request are sent to voice service agent apparatus;

Voice service agent apparatus carries out mutual equipment by universal phonetic service communication protocols and at least one voice application means and speech cloud server, voice service agent apparatus can be independently server in network, also can integrate with voice application means, the server node that can also be split as multiple equipment different in a network provides service.The described speech recognition that the identification of voice service agent apparatus receives and/or phonetic synthesis request; At least one task accordingly of tasking described speech recognition and/or phonetic synthesis request is divided to realize unit; Realize unit and speech cloud server interaction by described task, obtain the result of described speech recognition and/or phonetic synthesis request and return at least one voice application means described.

Speech cloud server is the concrete sound service processing apparatus with sound bank, speech cloud server can be independently server in network, the server set that also can multiplely provide different phonetic to serve, sound bank can provide by independent server, and also can integrate with a certain voice server provides; Preferably, speech cloud server is jointly made up of multiple random distribution server node in a network, selects wherein a certain node to process concrete voice application, thus guarantee service quality according to certain scheduling strategy.The data that speech cloud server realizes unit transmission according at least one task corresponding carry out speech recognition and/or phonetic synthesis process, result are returned to voice service agent apparatus.

In embodiments of the invention, existing sound bank and voice service can be made full use of and without the need to carrying out complicated secondary development, application aspect only needs the Transmission Control Protocol by standard, to voice service, agency sends command request, is then received and the message that reports of processed voice service broker by Transmission Control Protocol.Voice service Agency, uniformly request identified and carry out task assignment, thus can carry out dispatch deal to various dissimilar application request adaptively, there is stronger adaptability and flexibility ratio.In an embodiment of the present invention, those correspondences are used for the very large part of development difficulty, as the specific implementation etc. of hardware platform, development environment, speech recognition and phonetic synthesis, all give voice service agency forward to concrete speech cloud server unify process, and speech cloud server can utilize same sound bank, adopt universal phonetic service communication protocols to carry out alternately with outside, avoid the overlapping development of sound bank and service.Correspondence is used, because Transmission Control Protocol is the standard agreement had nothing to do with platform and language, so when doing application and development on embedded platform, also just get around can platform, operating system, speech recognition SDK and phonetic synthesis SDK integrated barrier, application and development integrated speech function is allowed to be no longer a kind of burden, but a kind of method of efficient lifting using value.

By the way, the present invention proposes the agreement for different application scenarioss and application software and interfacing, relevant interface can increase phonetic function support very fast in common application program, and Maintenance free speech engine; In various existing application, expand new voice technology also will be very easy to.Particularly, by service broker, all application can be acted on behalf of by voice service and use speech recognition and/or phonetic synthesis application.In the present invention, application need not be concerned about the specific implementation of voice technology completely, only need the interface interchange by simply unified voice service proxy server, act on behalf of with voice service and carry out message communicating and obtain voice service support, application just can be allowed to increase phonetic function, thus to achieve in common application integrated speech processing capacity simply and easily, and without the need to carrying out separately the exploitation of sound bank for each application, greatly save the wasting of resources, improve development efficiency.

Above execution mode is only for illustration of the present invention; and be not limitation of the present invention; the those of ordinary skill of relevant technical field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification; therefore all equivalent technical schemes also belong to category of the present invention, and real protection scope of the present invention should be defined by the claims.

Claims

1. a voice service Proxy Method, is characterized in that, described method comprises step:

S1, receives the voice request that voice application sends;

S5, returns described result to the described voice application sending described voice request;

Wherein, in common application the unified call interface of integrated speech service broker to form described voice application;

And the mutual employing asynchronous controlling mode between described voice application utilizes the Transmission Control Protocol based on Socket to carry out acknowledgement messaging;

Wherein, adopt local Socket communicate according to described universal phonetic service communication protocols between user's application module and voice service agency, the connection waiting for user's application module as the service end of TCP is upon actuation acted on behalf of in voice service; User's application module and voice service are acted on behalf of to maintain to grow and are connected, if do not do extra agreement, voice service is acted on behalf of use 20000 as listening port;

When step S2 identifies that described voice request is speech recognition request, in step S3, according to recognition result, described speech recognition request is assigned as voice recognition tasks;

In step S4, call microphone and carry out recording acquisition voice stream data to be identified; The voice stream data of acquisition is sent to speech cloud server, carries out voice recognition processing, after having processed, obtain the text message after identifying described voice flow from described speech cloud server;

In step S5, described text message is returned to the voice application sending described speech recognition request, the operation that described voice application shows described text message or performs corresponding to described text message;

When step S2 identifies that described voice request is phonetic synthesis request, in step S3, according to recognition result, described phonetic synthesis request is assigned as phonetic synthesis task;

In step S4, obtain Text Information Data to be synthesized from described voice application; The Text Information Data of acquisition is sent to speech cloud server, carries out phonetic synthesis process, obtain from described speech cloud server voice flow that described text message converts to after having processed and call loud speaker or the voice flow after described synthesis play by earphone;

In step S5, playing progress rate is returned to the voice application sending described phonetic synthesis request.

2. a voice service agent apparatus, is characterized in that, described device comprises:

Result feedback unit, for returning described result to the described voice application sending described voice request;

Wherein, described device also comprises interface unit, for the integration calling interface providing voice service to act on behalf of to common application, makes common application be formed as described voice application;

Described device also comprises response unit, utilizes carry out acknowledgement messaging based on the Transmission Control Protocol of Socket and described voice application for adopting asynchronous controlling mode;

When described in the identification of described request recognition unit, voice request is speech recognition request, described speech recognition request divides by described task assignment unit to be tasked voice recognition tasks and realizes unit;

Wherein, described voice recognition tasks realizes unit, recording acquisition voice stream data to be identified is carried out for calling microphone, send to speech cloud server to carry out voice recognition processing the voice stream data of acquisition, after having processed, obtain the text message after identifying described voice flow from described speech cloud server;

When described in the identification of described request recognition unit, voice request is phonetic synthesis request, described phonetic synthesis request divides by described task assignment unit to be tasked phonetic synthesis task and realizes unit;

Wherein, described phonetic synthesis task realizes unit, for obtaining Text Information Data to be synthesized from described voice application; Send to speech cloud server to carry out phonetic synthesis process the Text Information Data of acquisition, the voice flow after obtaining the synthesis of described text message from described speech cloud server after having processed also calls loud speaker or the voice flow after described synthesis play by earphone.

3., by a system for integration voice application, it is characterized in that, described system comprises:

Voice service agent apparatus, the described speech recognition received by integrated universal phonetic service communication protocols identification and/or phonetic synthesis request; At least one task accordingly of tasking described speech recognition and/or phonetic synthesis request is divided to realize unit; Realize unit and speech cloud server interaction by described task, obtain the result of described speech recognition and/or phonetic synthesis request and return at least one voice application means described; When above-mentioned arbitrary task realize unit make mistakes time, to voice application send error message feed back;