CN110223694A

CN110223694A - Method of speech processing, system and device

Info

Publication number: CN110223694A
Application number: CN201910563423.7A
Authority: CN
Inventors: 陈建哲; 欧阳能钧; 袁鼎
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-09-10
Anticipated expiration: 2039-06-26
Also published as: CN113823282A; CN110223694B

Abstract

The embodiment of the present application discloses method of speech processing, system and device.One specific embodiment of this method includes: the user speech that receiving terminal apparatus is sent, and carries out speech recognition to the user speech, obtains speech recognition result；Institute's speech recognition result is sent to semantic service device, receives reply text that the semantic service device returns, for institute's speech recognition result；The reply text is sent to voice synthesizing server, the reply voice of the received voice synthesizing server transmission of institute is forwarded to the terminal device.The embodiment of the present application is omitted the result that terminal device returns to server and is analyzed and processed and generates request, has been effectively saved the processing time, and then can shorten terminal device and when user interacts, the reaction time of terminal device.

Description

Method of speech processing, system and device

Technical field

The invention relates to field of computer technology, and in particular at Internet technical field more particularly to voice Manage mthods, systems and devices.

Background technique

In the related technology, during user and terminal device carry out interactive voice, terminal device and service are generally required Device is repeatedly interacted.In general, terminal device needs successively to speech recognition server, semantics recognition server and language Sound synthesis server sends processing request, to interact with these servers.

And terminal device to server send processing request before, need to be analyzed and processed, thus dragged slowly with Family carries out reaction speed when interactive voice.Also, terminal device repeatedly with the communication process of server, it is also desirable to consumption is a large amount of Time.

Summary of the invention

The embodiment of the present application proposes method of speech processing, system and device.

In a first aspect, the embodiment of the present application provides a kind of method of speech processing, it to be used for speech recognition server, this method Include: the user speech that receiving terminal apparatus is sent, speech recognition is carried out to user speech, obtains speech recognition result；To language Adopted server sends speech recognition result, receives reply text that semantic service device returns, for speech recognition result；To language Sound synthesis server, which is sent, replys text, and the reply voice that the received voice synthesizing server of institute is sent is to terminal device turn Hair.

In some embodiments, speech recognition server is set to same with semantic service device, voice synthesizing server In local area network.

In some embodiments, method further include: in response to obtaining speech recognition result, send voice to terminal device and know Other result；And method further include: in response to receiving reply text, sent to terminal device and reply text.

In some embodiments, before sending speech recognition result to semantic service device, method further include: judge voice Whether recognition result is effectively and related to the recognition result of a upper voice, generates the first judging result, wherein a upper voice With user speech in the same wake-up interactive process；And speech recognition result is sent to semantic service device, comprising: Xiang Yuyi Server sends speech recognition result, so that semantic service device judges whether speech recognition result meets default session semantic type And generate the second judging result；And before sending speech recognition result to terminal device, method further include: receive semantic clothes Second judging result of device feedback of being engaged in, is based on the first judging result and the second judging result, determines whether user speech is intentional Adopted voice.

In some embodiments, speech recognition result is sent to terminal device, comprising: in response to determining that user speech is to have Meaning voice sends speech recognition result to terminal device.

In some embodiments, it is based on the first judging result and the second judging result, determines whether user speech is intentional Adopted voice, comprising: in response to determine at least one of the first judging result and the second judging result be it is yes, determine user speech For significant voice.

In some embodiments, the first judging result and the second judging result are indicated in the form of numerical value, the first judgement knot The numerical value of fruit is for characterizing speech recognition result effectively and probability relevant to the recognition result of a upper voice, the second judgement knot The numerical value of fruit is for characterizing the probability that speech recognition result meets default session semantic type；And based on the first judging result and Second judging result determines whether user speech is significant voice, comprising: the numerical value and second for determining the first judging result are sentenced The sum of the numerical value of disconnected result；In response to determining and being greater than or equal to preset threshold, determine that user speech is significant voice.

In some embodiments, the numerical value of the second judging result is that semantic service device utilizes multiple default session semantic types Maximum numerical value in multiple candidate values that model is determined.

Second aspect, the embodiment of the present application provide a kind of voice processing apparatus, are used for speech recognition server, the device Include: voice recognition unit, be configured to the user speech of receiving terminal apparatus transmission, speech recognition is carried out to user speech, Obtain speech recognition result；Text generation unit is configured to send speech recognition result to semantic service device, receives semantic clothes At least one reply text that business device returns, for speech recognition result；Feedback unit is configured to speech synthesis service Device sends at least one and replys the reply text in text, by the reply voice of institute's received voice synthesizing server transmission to end End equipment forwarding, wherein reply voice is the reply text generation sent based on voice synthesizing server.

In some embodiments, device further include: the first transmission unit is configured in response to obtain speech recognition knot Fruit sends speech recognition result to terminal device；And method further include: the second transmission unit is configured in response to receive To text is replied, is sent to terminal device and reply text.

In some embodiments, device further include: judging unit is configured to sending speech recognition to semantic service device As a result before, judge whether speech recognition result is effectively and related to the recognition result of a upper voice, generate the first judgement knot Fruit, wherein a upper voice and user speech are in the same wake-up interactive process；And text generation unit, comprising: first Sending module is configured to send speech recognition result to semantic service device, so that semantic service device judges speech recognition result Whether meet default session semantic type and generates the second judging result；And device further include: receiving unit is configured to Before sending speech recognition result to terminal device, the second judging result of semantic service device feedback is received, based on the first judgement As a result with the second judging result, determine whether user speech is significant voice.

In some embodiments, the first transmission unit, comprising: the second sending module is in response to determining that user speech is intentional Adopted voice sends speech recognition result to terminal device.

In some embodiments, receiving unit comprises determining that module, be configured in response to determine the first judging result and At least one of second judging result be it is yes, determine user speech be significant voice.

The third aspect, the embodiment of the present application provide a kind of speech processing system, including speech recognition server, semantic clothes Business device and voice synthesizing server；Speech recognition server, for the user speech that receiving terminal apparatus is sent, to user speech Speech recognition is carried out, speech recognition result is obtained, speech recognition result is sent to semantic service device, and by semantic service device The reply text of return is sent to voice synthesizing server, receives the reply language for the reply text that voice synthesizing server is sent Reply voice is sent to terminal device by sound.

In some embodiments, speech recognition server is also used in response to obtaining speech recognition result, to terminal device Send speech recognition result；And speech recognition server, it is also used to send in response to receiving reply text to terminal device Reply text.

In some embodiments, semantic service device, be also used to receive text generation request, wherein text generation request be Terminal device replys text and reply voice in response to not receiving in the first preset time period, sends to semantic service device , text generation request includes speech recognition result, and the first preset time period receives speech recognition result with terminal device and makees For time zero.

In some embodiments, voice synthesizing server is also used to receive speech synthesis request, wherein speech synthesis is asked Seeking Truth terminal device is in response in the second preset time period, receiving reply text and not receiving reply voice, Xiang Yuyin What synthesis server was sent, speech synthesis request includes replying text, and the second preset time period receives voice with terminal device Recognition result replys text as time zero to receive.

In some embodiments, speech recognition server is also used before sending speech recognition result to semantic service device In judge speech recognition result whether effectively and it is related to the recognition result of a upper voice, generation the first judging result, wherein A upper voice and user speech are in the same wake-up interactive process；Speech recognition server is also used to semantic service device Send speech recognition result；Semantic service device, is also used to judge whether speech recognition result meets default session semantic type simultaneously Generate the second judging result；And speech recognition server is also used to connect before sending speech recognition result to terminal device The second judging result of semantic service device feedback is received, the first judging result and the second judging result is based on, determines that user speech is No is significant voice.

In some embodiments, speech recognition server is also used in response to determining user speech be significant voice, to Terminal device sends speech recognition result.

In some embodiments, speech recognition server is also used in response to determining the first judging result and the second judgement At least one of as a result be it is yes, determine that user speech is significant voice.

In some embodiments, the first judging result and the second judging result are indicated in the form of numerical value, the first judgement knot The numerical value of fruit is for characterizing speech recognition result effectively and probability relevant to the recognition result of a upper voice, the second judgement knot The numerical value of fruit is for characterizing the probability that speech recognition result meets default session semantic type；And speech recognition server, also For determine the numerical value of the first judging result and the numerical value of the second judging result and；It is default in response to determining and being greater than or equal to Threshold value determines that user speech is significant voice.

In some embodiments, semantic service device is also used to determine using multiple default session semantic type models more A candidate values；Numerical value maximum in multiple candidate values is determined as to the numerical value of the second judging result.

Fourth aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: one or more processors；Storage dress It sets, for storing one or more programs, when one or more programs are executed by one or more processors, so that one or more A processor realizes the method such as any embodiment in method of speech processing.

5th aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence realizes the method such as any embodiment in method of speech processing when the program is executed by processor.

Speech processes scheme provided by the embodiments of the present application, firstly, the user speech that receiving terminal apparatus is sent, to user Voice carries out speech recognition, obtains speech recognition result.Later, speech recognition result is sent to semantic service device, received semantic Reply text that server returns, for speech recognition result.Text is replied finally, sending to voice synthesizing server, it will The reply voice that sends of received voice synthesizing server forwarded to terminal device.Terminal device is omitted in the embodiment of the present application The result returned to server is analyzed and processed and generates request, has been effectively saved the processing time, and then can shorten When terminal device and user interact, the reaction time of terminal device.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is that this application can be applied to exemplary system architecture figures therein；

Fig. 2 is the flow chart according to one embodiment of the method for speech processing of the application；

Fig. 3 is the structural schematic diagram according to one embodiment of the speech processing system of the application；

Fig. 4 is the structural schematic diagram according to one embodiment of the voice processing apparatus of the application；

Fig. 5 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 is shown can be using the exemplary system of the embodiment of the method for speech processing or voice processing apparatus of the application System framework 100.

As shown in Figure 1, system architecture 100 may include terminal device 101, network 102 and server 103,104,105. Network 102 between terminal device 101 and server 103,104,105 to provide the medium of communication link.Network 102 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..

User can be used terminal device 101 and be interacted by network 102 with server 103,104,105, to receive or send out Send message etc..Various telecommunication customer end applications can be installed, such as speech processing applications, video class are answered on terminal device 101 With, live streaming application, instant messaging tools, mailbox client, social platform software etc..

Here terminal device 101 can be hardware, be also possible to software.It, can be with when terminal device 101, being hardware It is the various electronic equipments with display screen, including but not limited to smart phone, tablet computer, E-book reader, on knee Portable computer and desktop computer etc..When terminal device 101 is software, it may be mounted at above-mentioned cited electronics and set In standby.Multiple softwares or software module may be implemented into (such as providing multiple softwares of Distributed Services or software mould in it Block), single software or software module also may be implemented into.It is not specifically limited herein.

Server 103,104,105 can be to provide the server of various services, may include speech recognition server, language Adopted server and voice synthesizing server.In practice, server 103,104,105 can be set in the same local area network. Such as the background server supported is provided terminal device 101.Background server can be to data such as the user speech received It carries out the processing such as analyzing, and processing result (such as reply voice) is fed back into terminal device.

It should be noted that method of speech processing provided by the embodiment of the present application can be by server 103,104,105 Or terminal device 101 executes, correspondingly, voice processing apparatus can be set to be set in server 103,104,105 or terminal In standby 101.

It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.

With continued reference to Fig. 2, the process 200 of one embodiment of the method for speech processing according to the application is shown.The language Voice handling method, comprising the following steps:

Step 201, the user speech that receiving terminal apparatus is sent carries out speech recognition to user speech, obtains voice knowledge Other result.

In the present embodiment, the executing subject (such as server shown in FIG. 1) of method of speech processing can receive terminal The user speech that equipment is sent.Also, above-mentioned executing subject can carry out speech recognition to user speech, to obtain speech recognition As a result.Specifically, speech recognition is to convert speech into the process of corresponding text.Here speech recognition result, which then refers to, to be turned The text got in return.

Step 202, to semantic service device send speech recognition result, receive semantic service device return, for voice know The reply text of other result.

In the present embodiment, above-mentioned executing subject can send obtained speech recognition result to semantic service device, and Receive the reply text that semantic service device returns.Here reply text is the reply text for upper speech recognition result. Specifically, semantic service device can be analyzed and processed speech recognition result, obtain during interacting with user, For replying the reply text of user.Herein, obtained reply text is typically only a reply text.

In some optional implementations of the present embodiment, the above method further include:

In response to obtaining speech recognition result, speech recognition result is sent to terminal device；Text is replied in response to receiving This, sends to terminal device and replys text.

In these optional implementations, above-mentioned executing subject can in response to obtaining speech recognition result, in time to Terminal device sends speech recognition result.In this way, terminal device can show speech recognition result to user in time, text is avoided Output delay.

Also, above-mentioned executing subject can send this time to terminal device in time in response to determining above-mentioned reply text Multiple text.In this way, terminal device can be while broadcasting reply voice to user in time, text is replied in display.

In some optional application scenarios of these implementations, to semantic service device send speech recognition result it Before, method further include: judge whether speech recognition result is effectively and related to the recognition result of a upper voice, generate first and sentence Disconnected result, wherein a upper voice and user speech are in the same wake-up interactive process；And language is sent to semantic service device Sound recognition result, comprising: speech recognition result is sent to semantic service device, so that semantic service device judges that speech recognition result is It is no to meet default session semantic type and generate the second judging result；Before sending speech recognition result to terminal device, side Method includes: to receive the second judging result of semantic service device feedback, is based on the first judging result and the second judging result, determines and uses Whether family voice is significant voice.

In these optional application scenarios, above-mentioned executing subject can judge speech recognition result, Jin Ersheng At the first judging result.And speech recognition result and the first judging result are sent to semantic service device, so that semantic service Device judges whether speech recognition result meets default session semantic type.Whether above-mentioned executing subject determines user speech in turn For significant voice.Specifically, above-mentioned executing subject needs to judge whether speech recognition result is effective, it is also necessary to judge that voice is known Whether other result and the recognition result of a upper voice are related.When judge speech recognition result whether effectively and with a upper language The recognition result of sound is related, can determine that the first judging result is yes.User speech is issued after following a voice closely Voice, with a upper voice in the same wake-up interactive process.

Speech recognition result can effectively refer to that speech recognition result has clear meaning, can by the speech recognition result It is exchanged.For example speech recognition result " today, weather was how " is effective, and " " is then invalid.With a upper language It is continuous that the semanteme for the voice that the recognition result correlation of sound issues before and after then referring to, which is associated, semantic logic,.On such as The recognition result of one voice is " today, how is weather ", and the recognition result of user speech is " weather of tomorrow ", then this two The recognition result of a voice is related.For another example, the recognition result of a upper voice is " today, how is weather ", the knowledge of user speech Not the result is that " oh ", then the recognition result of the two voices is uncorrelated.

Default session semantic type is the semantic type of pre-set session, is referred to as class of hanging down.For example, default Session semantic type may include date type, cuisines type, navigation type etc..

Above-mentioned semantic service device can judge whether speech recognition result meets default session semantic category using various ways Type.For example, determining that the keyword of speech recognition result is target keywords, it is corresponding pre- to search each default session semantic type If whether including above-mentioned target keywords in keyword.If including the second judging result is to meet default session semantic type.

In practice, above-mentioned executing subject can receive the second judging result of semantic service device feedback, and be based on first Judging result and the second judging result, it is final to determine whether user speech is significant voice.Significant voice refers to the voice Speech recognition result is effective, and related to the recognition result of a upper voice.Here speech recognition result whether effectively and phase It closes, needs to carry out comprehensive descision using the first judging result and the second judging result.

Specifically, above-mentioned executing subject can be determined using various ways based on the first judging result and the second judging result Whether user speech is significant voice.For example, above-mentioned executing subject if it is determined that the first judging result and the second judging result all It is yes, it is determined that user speech is significant voice.

Optionally, speech recognition result is sent to semantic service device, may include sending speech recognition to semantic service device As a result with the first judging result.Correspondingly, semantic service device can be based on the first judging result, whether judge speech recognition result Meet default session semantic type and generates the second judging result.

For example, the corresponding pass between the first judging result of characterization, speech recognition result and the second judging result can be preset The mapping table of system, semantic service device can inquire the mapping table, and find and the first judging result and speech recognition As a result corresponding second judging result.

Herein, semantic service device not only can feed back the second judging result to above-mentioned executing subject, can also be to above-mentioned Executing subject feed back the first judging result, in this way, above-mentioned executing subject can the first judging result and second based on feedback sentence Disconnected result determines whether user speech is significant voice in time.

The first judging result and the second judging result can be generated in the executing subject of these implementations, to determine user Whether voice is significant, is preferably analyzed to realize user speech.

In some optional situations of these application scenarios, speech recognition result is sent to terminal device, may include: In response to determining that user speech is significant voice, speech recognition result is sent to terminal device.

In these cases, however, it is determined that the user speech is significant voice, and above-mentioned executing subject can be to terminal device Send speech recognition result.In addition, above-mentioned executing subject can abandon above-mentioned voice if user speech is not significant voice Recognition result.Executing subject in the case of these can be in the case where user speech be significant voice, just to terminal device Speech recognition result is fed back, and user says that the corresponding sentence of some meaningless voices is then not necessarily to show to user, to subtract The process of few invalidation and the degree of intelligence for improving equipment.

Optionally, above-mentioned to be based on the first judging result and the second judging result, determine whether user speech is significant language Sound, may include: in response to determine at least one of the first judging result and the second judging result be it is yes, determine user speech For significant voice.

These implementations can use the judging result of speech recognition server and the judging result of semantic service device Neatly determine whether user speech is significant voice, to avoid speech recognition server or semantic service device individually true The mistake filtering that may cause when fixed significant voice or leakage filter process.For example, speech recognition result is " tomorrow ", it should A upper voice for voice is " today, how is weather ".Speech recognition server is determining speech recognition result and a upper voice Recognition result it is whether relevant during, it is possible that erroneous judgement, to obtain the first unrelated judging result.And semantic clothes Business device can then determine that speech recognition result meets the weather pattern in default session semantic type.

Optionally, the first judging result and the second judging result are indicated in the form of numerical value, the numerical value of the first judging result For characterizing speech recognition result effectively and probability relevant to the recognition result of a upper voice, the numerical value of the second judging result Meet the probability of default session semantic type for characterizing speech recognition result；And based on the first judging result and the second judgement As a result, determining whether user speech is significant voice, comprising: determine the numerical value and the second judging result of the first judging result The sum of numerical value；In response to determining and being greater than or equal to preset threshold, determine that user speech is significant voice.

Specifically, the first judging result and the second judging result can be presented in the form of numerical value.Numerical value is bigger, then generally Rate is bigger, and the numerical value of two judging results be added and it is bigger.For example, preset threshold 15, for one of Zhang San The speech recognition result of user speech, the numerical value of the first judging result are 5 (for example the full marks of the numerical value are 10), the second judgement knot The numerical value of fruit is 10 (for example the full marks of the numerical value are 10), then the two numerical value and be 15, should and be equal to preset threshold, so This user speech that can determine Zhang San is significant voice.

Optionally it is determined that the weighted sum of the numerical value of the first judging result and the numerical value of the second judging result；In response to determination Weighted sum is greater than or equal to default Weighted Threshold, determines that user speech is significant voice.

Above-mentioned executing subject can not only determine the sum of judging result to determine whether user speech is significant voice, also It can use the default weight of the first judging result and the default weight of the second judging result, to the first judging result and Two judging results are weighted.And the comparison result of the weighted sum and default Weighted Threshold obtained using weighting, to determine user Whether voice is significant voice.

In practice, during semantic service device generates the second judging result, multiple default meeting language be can use Adopted Type model determines multiple candidate values, and therefrom chooses numerical value of the maximum numerical value as the second judging result.Each Default session semantic type model can determine a candidate values to speech recognition result.

Specifically, default session semantic type model here can be the class model or mapping table etc. of hanging down.It lifts For example, vertical class model can be date vertical class model, the vertical class model of navigation etc..Here vertical class model can be nerve net Network model.For example, semantic service device can be defeated by the first judging result and speech recognition result if the class model that hangs down is neural network Enter vertical class model, and obtains the second judging result exported from vertical class model.

Step 203, it is sent to voice synthesizing server and replys text, time that the received voice synthesizing server of institute is sent Multiple voice is forwarded to terminal device.

In the present embodiment, the reply text received can be sent to voice synthesizing server by above-mentioned executing subject, So that voice synthesizing server carries out speech synthesis, reply voice is obtained.Later, above-mentioned executing subject can receive speech synthesis The reply voice that server is sent, and the reply voice is transmitted to terminal device.Voice synthesizing server carries out speech synthesis It specifically can be and the reply text received carried out from Text To Speech (Text To Speech, TTS) processing, to obtain The voice that can be broadcasted to user.

In some optional implementations of the present embodiment, speech recognition server and semantic service device, speech synthesis Server is set in the same local area network.

In these optional implementations, speech recognition server and semantic service device and voice synthesizing server can To be set in the same local area network.In this way, the communication speed between speech recognition server and semantic service device can be accelerated, And accelerate the communication speed between speech recognition server and voice synthesizing server.

Terminal device in the prior art needs after obtaining information, generates request, and request is successively sent to voice Identify server, semantic service device and voice synthesizing server.Also, terminal device also have to wait for each server to its Feedback information, could obtain information, and whole process consumes the plenty of time.In comparison, the above process is omitted in the present embodiment, Information transmitting is carried out between servers, has been effectively saved the processing time, and then can shorten terminal device and user's progress When interaction, the reaction time of terminal device.

As shown in figure 3, present invention also provides a kind of speech processing system, including speech recognition server 310, semantic clothes Business device 320 and voice synthesizing server 330.

Speech recognition server 310 carries out voice knowledge to user speech for the user speech that receiving terminal apparatus is sent Not, speech recognition result is obtained, speech recognition result is sent to semantic service device 320, and semantic service device 320 is returned Reply text be sent to voice synthesizing server 330, receive the reply language for the reply text that voice synthesizing server 330 is sent Reply voice is sent to terminal device by sound.

In some optional implementations of the present embodiment, speech recognition server 310 and semantic service device 320, language Sound synthesis server 330 is set in the same local area network.

In some optional implementations of the present embodiment, speech recognition server 310 is also used in response to obtaining language Sound recognition result sends speech recognition result to terminal device.

In addition, speech recognition server 310, is also used to send and reply to terminal device in response to receiving reply text Text.

In some optional implementations of the present embodiment, above-mentioned terminal device is also used to reply in response to receiving Voice, and do not receive speech recognition result and reply at least one in text, it shows and broadcasts default revert statement.

Specifically, if terminal device has received reply voice, but speech recognition result and/or reply are not received Text, terminal device can show the corresponding text of default revert statement, and broadcast the voice of default revert statement.For example, pre- If revert statement can be " network is bad, woulds you please try again later ".In this way, these embodiments can be shown to avoid information it is incomplete Problem avoids user that from can not accurately obtaining revert statement.

In some optional implementations of the present embodiment, semantic service device is also used to receive text generation request, In, text generation request is terminal device in response in the first preset time period, not receiving reply text and reply voice, It is sent to semantic service device, text generation request includes speech recognition result, and the first preset time period is received with terminal device To speech recognition result as time zero.

Specifically, it if terminal device is after receiving speech recognition result, is not received by and replys text and reply Voice then can send the text generation including speech recognition result to semantic service device 320 and request.In this way, semantic service device 320 can receive text generation request, and handle speech recognition result, generate and reply text.Here request is to ask Semantic service device 320 is asked to generate the information for replying text.Later, semantic service device 320 can will reply text feedback to terminal Equipment, then, it includes replying the speech synthesis request of text, and connect that terminal device can be sent to voice synthesizing server 330 Receive the reply voice that voice synthesizing server 330 is fed back.

In these implementations, semantic service device can connect in the case where not receiving reply text and reply voice The request that terminal device is sent is received, to ensure going on smoothly for interactive voice.

In some optional implementations of the present embodiment, voice synthesizing server is also used to receive speech synthesis and asks It asks, wherein speech synthesis request is terminal device in response in the second preset time period, receiving reply text and not receiving It to reply voice, is sent to voice synthesizing server, speech synthesis request includes replying text, and the second preset time period is with end End equipment receives speech recognition result or replys text as time zero to receive.

Specifically, if terminal device has received speech recognition result, and text is replied, but does not receive reply Voice can then send speech synthesis request to voice synthesizing server 330.In this way, voice synthesizing server 330 can handle Above-mentioned reply text generates reply voice, and reply voice is fed back to terminal device.

These implementations can be asked in the case where not receiving reply voice to the transmission of voice synthesizing server 330 It asks, to ensure going on smoothly for interactive voice.

In some optional implementations of the present embodiment, speech recognition server is sending language to semantic service device Before sound recognition result, it is also used to judge whether speech recognition result is effectively and related to the recognition result of a upper voice, it is raw At the first judging result, wherein a upper voice and user speech are in the same wake-up interactive process；Speech-recognition services Device is also used to send speech recognition result to semantic service device；Semantic service device, is also used to judge whether speech recognition result accords with It closes default session semantic type and generates the second judging result；And speech recognition server, voice is being sent to terminal device Before recognition result, it is also used to receive the second judging result of semantic service device feedback, is sentenced based on the first judging result and second Break as a result, determining whether user speech is significant voice.

In some optional implementations of the present embodiment, speech recognition server is also used in response to determining user Voice is significant voice, sends speech recognition result to terminal device.

In some optional implementations of the present embodiment, speech recognition server is also used in response to determining first At least one of judging result and the second judging result be it is yes, determine user speech be significant voice.

In some optional implementations of the present embodiment, the first judging result and the second judging result are with the shape of numerical value Formula indicates that the numerical value of the first judging result is effectively and related to the recognition result of a upper voice for characterizing speech recognition result Probability, the numerical value of the second judging result is for characterizing the probability that speech recognition result meets default session semantic type；And Speech recognition server, be also used to determine the first judging result numerical value and the second judging result numerical value and；In response to true Determine and be greater than or equal to preset threshold, determines that user speech is significant voice.

In some optional implementations of the present embodiment, semantic service device is also used to utilize multiple default meeting language Adopted Type model determines multiple candidate values；Numerical value maximum in multiple candidate values is determined as to the number of the second judging result Value.

With further reference to Fig. 4, as the realization to method shown in above-mentioned each figure, this application provides a kind of speech processes dresses The one embodiment set, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, which specifically can be applied to respectively In kind electronic equipment.

As shown in figure 4, the voice processing apparatus 400 of the present embodiment includes: voice recognition unit 401, text generation unit 402 and feedback unit 403.Wherein, voice recognition unit 401, be configured to receiving terminal apparatus transmission user speech, to Family voice carries out speech recognition, obtains speech recognition result；Text generation unit 402 is configured to send to semantic service device Speech recognition result receives at least one reply text that semantic service device returns, for speech recognition result；Feedback unit 403, it is configured to send at least one to voice synthesizing server and replys the reply text in text, the received voice of institute is closed It is forwarded at the reply voice that server is sent to terminal device, wherein reply voice is sent based on voice synthesizing server Reply text generation.

In some embodiments, the voice recognition unit 401 of voice processing apparatus 400 can receive terminal device transmission User speech.Also, above-mentioned executing subject can carry out speech recognition to user speech, to obtain speech recognition result.Specifically Ground, speech recognition are to convert speech into the process of corresponding text.Here speech recognition result, which then refers to, to be converted to Text.

In some embodiments, text generation unit 402 can send obtained speech recognition knot to semantic service device Fruit, and receive the reply text of semantic service device return.Here reply text is the reply for upper speech recognition result Text.Specifically, semantic service device can be analyzed and processed speech recognition result, obtain in the mistake interacted with user Cheng Zhong, for replying the reply text of user.

In some embodiments, the reply text received can be sent to voice synthesizing server by feedback unit 403, So that voice synthesizing server carries out speech synthesis, reply voice is obtained.Later, above-mentioned executing subject can receive speech synthesis The reply voice that server is sent, and the reply voice is transmitted to terminal device.

In some optional implementations of the present embodiment, device further include: the first transmission unit is configured to respond to In obtaining speech recognition result, speech recognition result is sent to terminal device；And method further include: the second transmission unit, quilt It is configured to receive reply text, is sent to terminal device and reply text.

In some optional implementations of the present embodiment, device further include: judging unit is configured in Xiang Yuyi Server send speech recognition result before, judge speech recognition result whether effectively and the recognition result phase with a upper voice It closes, generates the first judging result, wherein a upper voice and user speech are in the same wake-up interactive process；And text Generation unit, comprising: the first sending module is configured to send speech recognition result to semantic service device, so that semantic service Device judges whether speech recognition result meets default session semantic type and generate the second judging result；And device further include: Receiving unit is configured to before sending speech recognition result to terminal device, and receive semantic service device feedback second is sentenced Break as a result, determining whether user speech is significant voice based on the first judging result and the second judging result.

In some optional implementations of the present embodiment, the first transmission unit, comprising: the second sending module in response to It determines that user speech is significant voice, sends speech recognition result to terminal device.

In some optional implementations of the present embodiment, receiving unit comprises determining that module, is configured in response to Determine at least one of the first judging result and the second judging result be it is yes, determine user speech be significant voice.

In some optional implementations of the present embodiment, the first judging result and the second judging result are with the shape of numerical value Formula indicates that the numerical value of the first judging result is effectively and related to the recognition result of a upper voice for characterizing speech recognition result Probability, the numerical value of the second judging result is for characterizing the probability that speech recognition result meets default session semantic type；And Based on the first judging result and the second judging result, determine whether user speech is significant voice, comprising: determine the first judgement As a result the numerical value of numerical value and the second judging result and；In response to determining and being greater than or equal to preset threshold, user's language is determined Sound is significant voice.

In some optional implementations of the present embodiment, the numerical value of the second judging result utilizes more for semantic service device Maximum numerical value in multiple candidate values that a default session semantic type model is determined.

As shown in figure 5, electronic equipment 500 may include processing unit (such as central processing unit, graphics processor etc.) 501, random access can be loaded into according to the program being stored in read-only memory (ROM) 502 or from storage device 508 Program in memory (RAM) 503 and execute various movements appropriate and processing.In RAM 503, it is also stored with electronic equipment Various programs and data needed for 500 operations.Processing unit 501, ROM 502 and RAM503 are connected with each other by bus 504. Input/output (I/O) interface 505 is also connected to bus 504.

In general, following device can connect to I/O interface 505: including such as touch screen, touch tablet, keyboard, mouse, taking the photograph As the input unit 506 of head, microphone, accelerometer, gyroscope etc.；Including such as liquid crystal display (LCD), loudspeaker, vibration The output device 507 of dynamic device etc.；Storage device 508 including such as tape, hard disk etc.；And communication device 509.Communication device 509, which can permit electronic equipment 500, is wirelessly or non-wirelessly communicated with other equipment to exchange data.Although Fig. 5 shows tool There is the electronic equipment 500 of various devices, it should be understood that being not required for implementing or having all devices shown.It can be with Alternatively implement or have more or fewer devices.Each box shown in Fig. 5 can represent a device, can also root According to needing to represent multiple devices.

Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communication device 509, or from storage device 508 It is mounted, or is mounted from ROM 502.When the computer program is executed by processing unit 501, the implementation of the disclosure is executed The above-mentioned function of being limited in the method for example.It should be noted that the computer-readable medium of embodiment of the disclosure can be meter Calculation machine readable signal medium or computer readable storage medium either the two any combination.Computer-readable storage Medium for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, Or any above combination.The more specific example of computer readable storage medium can include but is not limited to: have one Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access storage device (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In embodiment of the disclosure, computer-readable to deposit Storage media can be any tangible medium for including or store program, which can be commanded execution system, device or device Part use or in connection.And in embodiment of the disclosure, computer-readable signal media may include in base band In or as carrier wave a part propagate data-signal, wherein carrying computer-readable program code.This propagation Data-signal can take various forms, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Meter Calculation machine readable signal medium can also be any computer-readable medium other than computer readable storage medium, which can Read signal medium can be sent, propagated or be transmitted for being used by instruction execution system, device or device or being tied with it Close the program used.The program code for including on computer-readable medium can transmit with any suitable medium, including but not It is limited to: electric wire, optical cable, RF (radio frequency) etc. or above-mentioned any appropriate combination.

Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include voice recognition unit, text generation unit and feedback unit.Wherein, the title of these units is not constituted under certain conditions Restriction to the unit itself, for example, voice recognition unit is also described as " user's language that receiving terminal apparatus is sent Sound carries out speech recognition to user speech, obtains the unit of speech recognition result ".

As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment；It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should Device: the user speech that receiving terminal apparatus is sent carries out speech recognition to user speech, obtains speech recognition result；To language Adopted server sends speech recognition result, receives reply text that semantic service device returns, for speech recognition result；To language Sound synthesis server, which is sent, replys text, and the reply voice that the received voice synthesizing server of institute is sent is to terminal device turn Hair.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims

1. a kind of method of speech processing is used for speech recognition server, which comprises

The user speech that receiving terminal apparatus is sent carries out speech recognition to the user speech, obtains speech recognition result；

Send institute's speech recognition result to semantic service device, receive it is that the semantic service device returns, know for the voice The reply text of other result；

The reply text is sent to voice synthesizing server, by the reply language of the received voice synthesizing server transmission of institute Sound is forwarded to the terminal device.

2. according to the method described in claim 1, wherein, the speech recognition server and the semantic service device, institute's predicate Sound synthesis server is set in the same local area network.

3. according to the method described in claim 1, wherein, the method also includes:

The speech recognition result in response to obtaining, Xiang Suoshu terminal device send institute's speech recognition result；And

The method also includes:

In response to receiving the reply text, Xiang Suoshu terminal device sends the reply text.

4. according to the method described in claim 3, wherein, it is described to semantic service device send institute's speech recognition result it Before, the method also includes:

Judge whether institute's speech recognition result is effectively and related to the recognition result of a upper voice, generates the first judgement knot Fruit, wherein a upper voice and the user speech are in the same wake-up interactive process；And

It is described to send institute's speech recognition result to semantic service device, comprising:

Institute's speech recognition result is sent to the semantic service device, so that the semantic service device judges the speech recognition knot Whether fruit meets default session semantic type and generates the second judging result；And

Before transmission institute's speech recognition result to the terminal device, the method also includes:

Second judging result for receiving the semantic service device feedback, is sentenced based on first judging result and described second Break as a result, determining whether the user speech is significant voice.

5. described to send institute's speech recognition result, packet to the terminal device according to the method described in claim 4, wherein It includes:

It is significant voice in response to the determination user speech, Xiang Suoshu terminal device sends institute's speech recognition result.

6. according to the method described in claim 4, wherein, first judging result and described second that is based on judges knot Fruit determines whether the user speech is significant voice, comprising:

Be in response at least one of determination first judging result and described second judging result it is yes, determine the user Voice is significant voice.

7. according to the method described in claim 4, wherein, first judging result and second judging result are with numerical value Form indicates that the numerical value of first judging result is for characterizing institute's speech recognition result effectively and the knowledge with a upper voice The relevant probability of other result, the numerical value of second judging result for characterize institute's speech recognition result meet it is default can language The probability of adopted type；And

It is described to be based on first judging result and second judging result, determine whether the user speech is significant language Sound, comprising:

Determine the numerical value of first judging result and the numerical value of second judging result and；It is described and big in response to determining In or equal to preset threshold, determine that the user speech is significant voice.

8. according to the method described in claim 7, wherein, the numerical value of second judging result is semantic service device utilization Maximum numerical value in multiple candidate values that multiple default session semantic type models are determined.

9. a kind of speech processing system, including speech recognition server, semantic service device and voice synthesizing server；

The speech recognition server carries out voice to the user speech for the user speech that receiving terminal apparatus is sent Identification, obtains speech recognition result, institute's speech recognition result is sent to the semantic service device, and the semanteme is taken The reply text that business device returns is sent to the voice synthesizing server, receives described time that the voice synthesizing server is sent The reply voice of multiple text, is sent to the terminal device for the reply voice.

10. system according to claim 9, wherein the speech recognition server and the semantic service device, institute's predicate Sound synthesis server is set in the same local area network.

11. system according to claim 9, wherein

The speech recognition server, is also used to the speech recognition result in response to obtaining, and Xiang Suoshu terminal device sends institute Speech recognition result；And

The speech recognition server is also used in response to receiving the reply text, described in Xiang Suoshu terminal device is sent Reply text.

12. the system according to one of claim 9-11, wherein

The semantic service device is also used to receive text generation request, wherein the text generation request is the terminal device In response in the first preset time period, not receiving the reply text and the reply voice, Xiang Suoshu semantic service device It sends, the text generation request includes the speech recognition result, and first preset time period is with the terminal device Institute's speech recognition result is received as time zero.

13. the system according to one of claim 9-11, wherein

The voice synthesizing server is also used to receive speech synthesis request, wherein the speech synthesis request is the terminal Equipment is in response in the second preset time period, receiving the reply text and not receiving the reply voice, Xiang Suoshu What voice synthesizing server was sent, the speech synthesis request includes the reply text, and second preset time period is with institute It states terminal device and receives institute's speech recognition result or to receive the reply text as time zero.

14. system according to claim 11, wherein

The speech recognition server is also used to judge before transmission institute's speech recognition result to semantic service device Whether institute's speech recognition result is effectively and related to the recognition result of a upper voice, generates the first judging result, wherein institute A voice and the user speech are stated in the same wake-up interactive process；

The speech recognition server is also used to send institute's speech recognition result to the semantic service device；

The semantic service device, is also used to judge whether institute's speech recognition result meets default session semantic type and generate the Two judging results；And

The speech recognition server is also used to connect before transmission institute's speech recognition result to the terminal device Second judging result for receiving the semantic service device feedback, based on first judging result and the second judgement knot Fruit determines whether the user speech is significant voice.

15. system according to claim 14, wherein the speech recognition server is also used in response to described in determination User speech is significant voice, and Xiang Suoshu terminal device sends institute's speech recognition result.

16. system according to claim 14, wherein

The speech recognition server is also used in response in determination first judging result and second judging result At least one be it is yes, determine the user speech be significant voice.

17. system according to claim 14, wherein first judging result and second judging result are with numerical value Form indicate, the numerical value of first judging result for characterize institute's speech recognition result effectively and with a upper voice The numerical value of the relevant probability of recognition result, second judging result meets default session for characterizing institute's speech recognition result The probability of semantic type；And

The speech recognition server is also used to determine the numerical value of first judging result and the number of second judging result The sum of value；In response to determining described and being greater than or equal to preset threshold, determine that the user speech is significant voice.

18. system according to claim 17, wherein

The semantic service device is also used to determine multiple candidate values using multiple default session semantic type models；By institute State the numerical value that maximum numerical value in multiple candidate values is determined as second judging result.

19. a kind of voice processing apparatus, is used for speech recognition server, described device includes:

Voice recognition unit is configured to the user speech of receiving terminal apparatus transmission, carries out voice knowledge to the user speech Not, speech recognition result is obtained；

Text generation unit is configured to send institute's speech recognition result to semantic service device, receives the semantic service device At least one reply text returning, for institute's speech recognition result；

Feedback unit is configured to send the reply text at least one described reply text to voice synthesizing server, will The reply voice that sends of the received voice synthesizing server forwarded to the terminal device, wherein the reply voice It is the reply text generation sent based on the voice synthesizing server.

20. a kind of electronic equipment, comprising:

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method described in any one of claims 1-8.

21. a kind of computer readable storage medium, is stored thereon with computer program, wherein when the program is executed by processor Realize such as method described in any one of claims 1-8.