CN110223694B

CN110223694B - Voice processing method, system and device

Info

Publication number: CN110223694B
Application number: CN201910563423.7A
Authority: CN
Inventors: 陈建哲; 欧阳能钧; 袁鼎
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2021-10-15
Anticipated expiration: 2039-06-26
Also published as: CN113823282A; CN110223694A

Abstract

The embodiment of the application discloses a voice processing method, a system and a device. One embodiment of the method comprises: receiving user voice sent by terminal equipment, and carrying out voice recognition on the user voice to obtain a voice recognition result; sending the voice recognition result to a semantic server, and receiving a reply text aiming at the voice recognition result returned by the semantic server; and sending the reply text to a voice synthesis server, and forwarding the received reply voice sent by the voice synthesis server to the terminal equipment. According to the embodiment of the application, the analysis processing and the request generation of the result returned by the server by the terminal equipment are omitted, the processing time is effectively saved, and the reaction time of the terminal equipment when the terminal equipment interacts with the user can be further shortened.

Description

Voice processing method, system and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of internet, and particularly relates to a voice processing method, system and device.

Background

In the related art, in the process of voice interaction between a user and a terminal device, the terminal device and a server are often required to perform multiple interactions. Generally, a terminal device needs to send processing requests to a voice recognition server, a semantic recognition server, and a voice synthesis server in order to interact with these servers.

Before the terminal device sends a processing request to the server, analysis processing needs to be performed, so that the reaction speed when voice interaction is performed with the user is slowed down. Moreover, the communication process between the terminal device and the server needs to consume a large amount of time.

Disclosure of Invention

The embodiment of the application provides a voice processing method, a system and a device.

In a first aspect, an embodiment of the present application provides a speech processing method, which is used for a speech recognition server, and the method includes: receiving user voice sent by terminal equipment, and carrying out voice recognition on the user voice to obtain a voice recognition result; sending a voice recognition result to a semantic server, and receiving a reply text aiming at the voice recognition result returned by the semantic server; and sending a reply text to the voice synthesis server, and forwarding the received reply voice sent by the voice synthesis server to the terminal equipment.

In some embodiments, the voice recognition server is disposed in the same local area network as the semantic server and the voice synthesis server.

In some embodiments, the method further comprises: responding to the obtained voice recognition result, and sending the voice recognition result to the terminal equipment; and the method further comprises: and responding to the received reply text, and sending the reply text to the terminal equipment.

In some embodiments, prior to sending the speech recognition results to the semantic server, the method further comprises: judging whether the voice recognition result is valid and related to the recognition result of the previous voice to generate a first judgment result, wherein the previous voice and the user voice are in the same awakening interaction process; and sending the speech recognition result to a semantic server, including: sending a voice recognition result to a semantic server so that the semantic server judges whether the voice recognition result accords with a preset session semantic type and generates a second judgment result; and before sending the voice recognition result to the terminal device, the method further comprises: and receiving a second judgment result fed back by the semantic server, and determining whether the user voice is a meaningful voice based on the first judgment result and the second judgment result.

In some embodiments, sending the speech recognition result to the terminal device includes: and responding to the determination that the user voice is meaningful voice, and sending a voice recognition result to the terminal equipment.

In some embodiments, determining whether the user voice is a meaningful voice based on the first determination result and the second determination result includes: in response to determining that at least one of the first determination result and the second determination result is yes, determining that the user speech is a meaningful speech.

In some embodiments, the first determination result and the second determination result are represented in a numerical value, where the numerical value of the first determination result is used to represent a probability that the speech recognition result is valid and related to the recognition result of the previous speech, and the numerical value of the second determination result is used to represent a probability that the speech recognition result conforms to the preset session semantic type; and determining whether the user voice is a meaningful voice based on the first determination result and the second determination result, including: determining the sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

In some embodiments, the value of the second determination result is the largest value among a plurality of candidate values determined by the semantic server using a plurality of predetermined session semantic type models.

In a second aspect, an embodiment of the present application provides a speech processing apparatus for a speech recognition server, where the apparatus includes: the voice recognition unit is configured to receive the user voice sent by the terminal equipment, and perform voice recognition on the user voice to obtain a voice recognition result; the text generation unit is configured to send the voice recognition result to the semantic server and receive at least one reply text which is returned by the semantic server and aims at the voice recognition result; and the feedback unit is configured to send a reply text in the at least one reply text to the voice synthesis server, and forward the received reply voice sent by the voice synthesis server to the terminal equipment, wherein the reply voice is generated based on the reply text sent by the voice synthesis server.

In some embodiments, the apparatus further comprises: a first transmitting unit configured to transmit a voice recognition result to the terminal device in response to obtaining the voice recognition result; and the method further comprises: and a second sending unit configured to send the reply text to the terminal device in response to receiving the reply text.

In some embodiments, the apparatus further comprises: the judging unit is configured to judge whether the voice recognition result is effective and is related to the recognition result of the last voice before sending the voice recognition result to the semantic server, and a first judgment result is generated, wherein the last voice and the user voice are in the same awakening interaction process; and a text generation unit including: the first sending module is configured to send the voice recognition result to the semantic server so that the semantic server judges whether the voice recognition result accords with a preset session semantic type and generates a second judgment result; and the apparatus further comprises: and a receiving unit configured to receive a second judgment result fed back by the semantic server before sending the voice recognition result to the terminal device, and determine whether the user voice is a meaningful voice based on the first judgment result and the second judgment result.

In some embodiments, the first sending unit comprises: the second sending module sends the voice recognition result to the terminal device in response to determining that the user voice is the meaningful voice.

In some embodiments, the receiving unit comprises: a determination module configured to determine the user speech as meaningful speech in response to determining at least one of the first determination result and the second determination result is yes.

In a third aspect, an embodiment of the present application provides a speech processing system, including a speech recognition server, a semantic server, and a speech synthesis server; and the voice recognition server is used for receiving the user voice sent by the terminal equipment, carrying out voice recognition on the user voice to obtain a voice recognition result, sending the voice recognition result to the semantic server, sending a reply text returned by the semantic server to the voice synthesis server, receiving the reply voice of the reply text sent by the voice synthesis server, and sending the reply voice to the terminal equipment.

In some embodiments, the voice recognition server is further configured to send the voice recognition result to the terminal device in response to obtaining the voice recognition result; and the voice recognition server is also used for responding to the received reply text and sending the reply text to the terminal equipment.

In some embodiments, the semantic server is further configured to receive a text generation request, where the text generation request is sent to the semantic server by the terminal device in response to not receiving the reply text and the reply voice within a first preset time period, the text generation request includes a voice recognition result, and the first preset time period starts when the terminal device receives the voice recognition result.

In some embodiments, the speech synthesis server is further configured to receive a speech synthesis request, where the speech synthesis request is sent to the speech synthesis server by the terminal device in response to receiving the reply text and not receiving the reply speech within a second preset time period, the speech synthesis request includes the reply text, and the second preset time period starts when the terminal device receives the speech recognition result or receives the reply text as a timing starting point.

In some embodiments, the speech recognition server, before sending the speech recognition result to the semantic server, is further configured to determine whether the speech recognition result is valid and related to a recognition result of a previous speech, and generate a first determination result, where the previous speech and the user speech are in a same wake-up interaction process; the voice recognition server is also used for sending a voice recognition result to the semantic server; the semantic server is also used for judging whether the voice recognition result accords with the preset session semantic type and generating a second judgment result; and the voice recognition server is used for receiving a second judgment result fed back by the semantic server before sending the voice recognition result to the terminal equipment, and determining whether the voice of the user is meaningful voice or not based on the first judgment result and the second judgment result.

In some embodiments, the voice recognition server is further configured to send the voice recognition result to the terminal device in response to determining that the user voice is meaningful voice.

In some embodiments, the speech recognition server is further configured to determine the user speech as the meaningful speech in response to determining at least one of the first determination result and the second determination result is yes.

In some embodiments, the first determination result and the second determination result are represented in a numerical value, where the numerical value of the first determination result is used to represent a probability that the speech recognition result is valid and related to the recognition result of the previous speech, and the numerical value of the second determination result is used to represent a probability that the speech recognition result conforms to the preset session semantic type; the voice recognition server is also used for determining the sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

In some embodiments, the semantic server is further configured to determine a plurality of candidate values using a plurality of predefined session semantic type models; and determining the maximum value in the plurality of candidate values as the value of the second judgment result.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method as in any embodiment of the speech processing method.

In a fifth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method as in any embodiment of the speech processing method.

According to the voice processing scheme provided by the embodiment of the application, firstly, the user voice sent by the terminal equipment is received, and the voice recognition is carried out on the user voice to obtain a voice recognition result. And then, sending a voice recognition result to the semantic server, and receiving a reply text aiming at the voice recognition result returned by the semantic server. And finally, sending a reply text to the voice synthesis server, and forwarding the received reply voice sent by the voice synthesis server to the terminal equipment. According to the embodiment of the application, the analysis processing and the request generation of the result returned by the server by the terminal equipment are omitted, the processing time is effectively saved, and the reaction time of the terminal equipment when the terminal equipment interacts with the user can be further shortened.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram for one embodiment of a speech processing method according to the present application;

FIG. 3 is a schematic block diagram of one embodiment of a speech processing system according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a speech processing apparatus according to the present application;

FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the speech processing method or speech processing apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and

servers

103, 104, 105. Network 102 is the medium used to provide communication links between terminal device 101 and

servers

103, 104, 105. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the terminal device 101 to interact with the

servers

103, 104, 105 over the network 102 to receive or send messages or the like. Various communication client applications, such as a voice processing application, a video application, a live application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.

Here, the terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it may be various electronic devices with a display screen, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. When the terminal apparatus 101 is software, it can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The

servers

103, 104, 105 may be servers providing various services, and may include a speech recognition server, a semantic server, and a speech synthesis server. In practice, the

servers

103, 104, 105 may be located within the same local area network. Such as a background server that provides support for the terminal device 101. The background server may analyze and perform other processing on the received data such as the user voice, and feed back a processing result (e.g., a reply voice) to the terminal device.

It should be noted that the voice processing method provided in the embodiment of the present application may be executed by the

servers

103, 104, and 105 or the terminal device 101, and accordingly, the voice processing apparatus may be disposed in the

servers

103, 104, and 105 or the terminal device 101.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a speech processing method according to the present application is shown. The voice processing method comprises the following steps:

step 201, receiving a user voice sent by a terminal device, and performing voice recognition on the user voice to obtain a voice recognition result.

In the present embodiment, the execution subject of the voice processing method (e.g., the server shown in fig. 1) may receive the user voice transmitted by the terminal device. And, the execution subject can perform voice recognition on the user voice to obtain a voice recognition result. Specifically, speech recognition is the process of converting speech into corresponding text. The speech recognition result here refers to the converted text.

Step 202, sending the voice recognition result to the semantic server, and receiving a reply text aiming at the voice recognition result returned by the semantic server.

In this embodiment, the execution subject may send the obtained speech recognition result to the semantic server, and receive a reply text returned by the semantic server. The reply text here is the reply text for the above-described speech recognition result. Specifically, the semantic server may analyze and process the speech recognition result to obtain a reply text for replying to the user in the process of interacting with the user. Here, the resulting reply text is generally only one reply text.

In some optional implementations of this embodiment, the method further includes:

responding to the obtained voice recognition result, and sending the voice recognition result to the terminal equipment; and responding to the received reply text, and sending the reply text to the terminal equipment.

In these alternative implementations, the execution main body may send the voice recognition result to the terminal device in time in response to obtaining the voice recognition result. Therefore, the terminal equipment can display the voice recognition result to the user in time, and the character output delay is avoided.

And the execution main body can respond to the determined reply text and timely send the reply text to the terminal equipment. Therefore, the terminal equipment can display the reply text while broadcasting the reply voice to the user in time.

In some optional application scenarios of these implementations, before sending the speech recognition result to the semantic server, the method further comprises: judging whether the voice recognition result is valid and related to the recognition result of the previous voice to generate a first judgment result, wherein the previous voice and the user voice are in the same awakening interaction process; and sending the speech recognition result to a semantic server, including: sending a voice recognition result to a semantic server so that the semantic server judges whether the voice recognition result accords with a preset session semantic type and generates a second judgment result; before sending the voice recognition result to the terminal equipment, the method comprises the following steps: and receiving a second judgment result fed back by the semantic server, and determining whether the user voice is a meaningful voice based on the first judgment result and the second judgment result.

In these optional application scenarios, the execution subject may determine a speech recognition result, and further generate a first determination result. And sending the voice recognition result and the first judgment result to a semantic server so that the semantic server judges whether the voice recognition result conforms to a preset session semantic type. The execution subject further determines whether the user voice is a meaningful voice. Specifically, the execution subject needs to determine whether the speech recognition result is valid, and also needs to determine whether the speech recognition result is related to the recognition result of the previous speech. When the speech recognition result is determined to be valid and related to the recognition result of the previous speech, it can be determined that the first determination result is yes. The user voice is the voice uttered immediately after the last voice, in the same wake-up interaction process as the last voice.

The validity of the voice recognition result may mean that the voice recognition result has a definite meaning and communication is possible by the voice recognition result. For example, the speech recognition result "how is it today" is valid, whereas "o" is invalid. The correlation with the recognition result of the previous speech means that the semantics of the speech uttered before and after are correlated, and the logic of the semantics is continuous. For example, if the recognition result of the previous speech is "how the weather is today" and the recognition result of the speech of the user is "weather in tomorrow", the recognition results of the two speeches are correlated. For another example, if the recognition result of the previous speech is "how the weather is today" and the recognition result of the user speech is "hiccup", the recognition results of the two speeches are not related.

The preset session semantic type is a type of a preset session semantic, and may also be called a vertical type. For example, the preset session semantic types may include a date type, a food type, a navigation type, and the like.

The semantic server can judge whether the voice recognition result conforms to the preset session semantic type in various ways. For example, determining keywords of the speech recognition result as target keywords, and searching whether preset keywords corresponding to each preset session semantic type include the target keywords. If yes, the second judgment result is in accordance with the preset session semantic type.

In practice, the execution subject may receive a second judgment result fed back by the semantic server, and finally determine whether the user voice is a meaningful voice based on the first judgment result and the second judgment result. The meaningful voice means that the voice recognition result of the voice is valid and is correlated with the recognition result of the previous voice. Whether the voice recognition result is valid or not needs to be comprehensively judged by using the first judgment result and the second judgment result.

Specifically, the execution main body may determine whether the user voice is a meaningful voice based on the first determination result and the second determination result in various ways. For example, if the execution main body determines that the first determination result and the second determination result are both yes, the execution main body determines that the user speech is a meaningful speech.

Optionally, sending the speech recognition result to the semantic server may include sending the speech recognition result and the first determination result to the semantic server. Accordingly, the semantic server may determine whether the voice recognition result conforms to the preset session semantic type based on the first determination result and generate a second determination result.

For example, a correspondence table representing a correspondence between the first determination result, the voice recognition result, and the second determination result may be preset, and the semantic server may query the correspondence table and find the second determination result corresponding to the first determination result and the voice recognition result.

Here, the semantic server may feed back not only the second determination result but also the first determination result to the execution main body, so that the execution main body may timely determine whether the user voice is a meaningful voice based on the fed back first determination result and second determination result.

The execution subjects of these implementations can generate the first determination result and the second determination result to determine whether the user voice is meaningful, thereby implementing better analysis of the user voice.

In some optional cases of these application scenarios, sending the speech recognition result to the terminal device may include: and responding to the determination that the user voice is meaningful voice, and sending a voice recognition result to the terminal equipment.

In these cases, the execution body may transmit the voice recognition result to the terminal device if it is determined that the user voice is a meaningful voice. In addition, if the user voice is not a meaningful voice, the execution body may discard the voice recognition result. Under the conditions, the execution main body can feed back the voice recognition result to the terminal equipment under the condition that the voice of the user is meaningful voice, and sentences corresponding to some meaningless voices spoken by the user do not need to be displayed to the user, so that the invalid processing process is reduced, and the intelligence degree of the equipment is improved.

Alternatively, the determining whether the user voice is the meaningful voice based on the first determination result and the second determination result may include: in response to determining that at least one of the first determination result and the second determination result is yes, determining that the user speech is a meaningful speech.

The implementation modes can flexibly determine whether the user voice is the meaningful voice by utilizing the judgment result of the voice recognition server and the judgment result of the semantic server, so that the error filtering or the filtering missing process which is possibly caused when the voice recognition server or the semantic server independently determines the meaningful voice is avoided. For example, the speech recognition result is "tomorrow", and the previous speech of the speech is "how today's weather". In the process of determining whether the voice recognition result is related to the recognition result of the previous voice, the voice recognition server may make a misjudgment, so as to obtain an unrelated first judgment result. And the semantic server can determine that the voice recognition result conforms to the weather type in the preset session semantic type.

Optionally, the first determination result and the second determination result are expressed in a numerical value form, the numerical value of the first determination result is used for representing the probability that the voice recognition result is valid and related to the recognition result of the previous voice, and the numerical value of the second determination result is used for representing the probability that the voice recognition result conforms to the preset session semantic type; and determining whether the user voice is a meaningful voice based on the first determination result and the second determination result, including: determining the sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

Specifically, both the first determination result and the second determination result may be presented in the form of numerical values. The larger the value, the larger the probability and the larger the sum of the values of the two judgments. For example, the preset threshold is 15, for the voice recognition result of one user voice of zhang san, the value of the first determination result is 5 (for example, the full score of the value is 10), and the value of the second determination result is 10 (for example, the full score of the value is 10), then the sum of the two values is 15, and the sum is equal to the preset threshold, so that the user voice of zhang san can be determined as a meaningful voice.

Optionally, determining a weighted sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the weighted sum is greater than or equal to a preset weighted threshold, determining the user speech to be meaningful speech.

The execution body may determine not only the sum of the determination results to determine whether the user voice is a meaningful voice, but also weight the first determination result and the second determination result by using the preset weight of the first determination result and the preset weight of the second determination result. And determining whether the user voice is a meaningful voice using a comparison result of the weighted sum obtained by the weighting with a preset weighting threshold.

In practice, in the process of generating the second judgment result by the semantic server, a plurality of candidate values can be determined by using a plurality of preset session semantic type models, and the maximum value is selected from the candidate values as the value of the second judgment result. Each preset session semantic type model can determine a candidate value for the speech recognition result.

Specifically, the preset session semantic type model may be a vertical type model or a correspondence table, etc. For example, the vertical model may be a date vertical model, a navigation vertical model, and so on. The vertical model here may be a neural network model. For example, if the vertical model is a neural network, the semantic server may input the first determination result and the speech recognition result into the vertical model, and obtain a second determination result output from the vertical model.

Step 203, sending the reply text to the voice synthesis server, and forwarding the received reply voice sent by the voice synthesis server to the terminal device.

In this embodiment, the execution main body may send the received reply text to the speech synthesis server, so that the speech synthesis server performs speech synthesis to obtain the reply speech. Then, the execution body may receive the reply voice sent by the voice synthesis server, and forward the reply voice to the terminal device. Specifically, the Speech synthesis performed by the Speech synthesis server may be To perform Text-To-Speech (TTS) processing on the received reply Text, so as To obtain a Speech that can be broadcasted To the user.

In some optional implementation manners of this embodiment, the speech recognition server is disposed in the same local area network as the semantic server and the speech synthesis server.

In these alternative implementations, the speech recognition server and the semantic server and the speech synthesis server may be located within the same local area network. Thus, the communication speed between the voice recognition server and the semantic server can be increased, and the communication speed between the voice recognition server and the voice synthesis server can be increased.

In the prior art, a terminal device needs to generate a request after acquiring information and sequentially send the request to a voice recognition server, a semantic server and a voice synthesis server. In addition, the terminal device must wait for each server to feed back information to the terminal device to obtain the information, and the whole process consumes a lot of time. In contrast, the embodiment omits the above process, and performs information transmission between the servers, thereby effectively saving the processing time, and further shortening the reaction time of the terminal device when the terminal device interacts with the user.

As shown in fig. 3, the present application further provides a speech processing system comprising a speech recognition server 310, a semantic server 320 and a speech synthesis server 330.

The voice recognition server 310 is configured to receive the user voice sent by the terminal device, perform voice recognition on the user voice to obtain a voice recognition result, send the voice recognition result to the semantic server 320, send a reply text returned by the semantic server 320 to the voice synthesis server 330, receive a reply voice of the reply text sent by the voice synthesis server 330, and send the reply voice to the terminal device.

In some optional implementations of this embodiment, the speech recognition server 310 is disposed in the same lan as the semantic server 320 and the speech synthesis server 330.

In some optional implementations of the embodiment, the speech recognition server 310 is further configured to send the speech recognition result to the terminal device in response to obtaining the speech recognition result.

And the voice recognition server 310 is further configured to send the reply text to the terminal device in response to receiving the reply text.

In some optional implementation manners of this embodiment, the terminal device is further configured to display and broadcast the preset reply sentence in response to receiving the reply voice without receiving at least one of the voice recognition result and the reply text.

Specifically, if the terminal device receives the reply voice but does not receive the voice recognition result and/or the reply text, the terminal device may display a text corresponding to the preset reply sentence and broadcast the voice of the preset reply sentence. For example, the preset reply sentence may be "network is not good, please try again later". Thus, the embodiments can avoid the problem of incomplete information display and avoid the problem that the user cannot accurately acquire the reply sentence.

In some optional implementation manners of this embodiment, the semantic server is further configured to receive a text generation request, where the text generation request is sent to the semantic server by the terminal device in response to that the reply text and the reply voice are not received within a first preset time period, the text generation request includes a voice recognition result, and the first preset time period takes the voice recognition result received by the terminal device as a timing starting point.

Specifically, if the terminal device does not receive the reply text and the reply voice after receiving the voice recognition result, a text generation request including the voice recognition result may be transmitted to the semantic server 320. In this way, the semantic server 320 may receive the text generation request and process the speech recognition result to generate a reply text. The request here is a message requesting the semantic server 320 to generate a reply text. Thereafter, the semantic server 320 may feed back the reply text to the terminal device, and then the terminal device may send a speech synthesis request including the reply text to the speech synthesis server 330 and receive the reply speech fed back by the speech synthesis server 330.

In these implementations, the semantic server may receive the request sent by the terminal device when the reply text and the reply voice are not received, so as to ensure smooth voice interaction.

In some optional implementations of this embodiment, the speech synthesis server is further configured to receive a speech synthesis request, where the speech synthesis request is sent to the speech synthesis server by the terminal device in response to receiving the reply text and not receiving the reply speech within a second preset time period, the speech synthesis request includes the reply text, and the second preset time period starts when the terminal device receives the speech recognition result or receives the reply text as a timing starting point.

Specifically, if the terminal device receives the voice recognition result and the reply text but does not receive the reply voice, it may send a voice synthesis request to the voice synthesis server 330. Thus, the speech synthesis server 330 can process the reply text, generate reply speech, and feed the reply speech back to the terminal device.

These implementations may send a request to speech synthesis server 330 to ensure that the voice interaction is proceeding smoothly without receiving a reply voice.

In some optional implementation manners of this embodiment, before sending the voice recognition result to the semantic server, the voice recognition server is further configured to determine whether the voice recognition result is valid and is related to a recognition result of a previous voice, and generate a first determination result, where the previous voice and the user voice are in the same wake-up interaction process; the voice recognition server is also used for sending a voice recognition result to the semantic server; the semantic server is also used for judging whether the voice recognition result accords with the preset session semantic type and generating a second judgment result; and the voice recognition server is used for receiving a second judgment result fed back by the semantic server before sending the voice recognition result to the terminal equipment, and determining whether the voice of the user is meaningful voice or not based on the first judgment result and the second judgment result.

In some optional implementations of the embodiment, the voice recognition server is further configured to send the voice recognition result to the terminal device in response to determining that the user voice is the meaningful voice.

In some optional implementations of the embodiment, the voice recognition server is further configured to determine the user voice as the meaningful voice in response to determining at least one of the first determination result and the second determination result is yes.

In some optional implementation manners of this embodiment, the first determination result and the second determination result are represented in a form of a numerical value, where the numerical value of the first determination result is used to represent a probability that the speech recognition result is valid and is related to a recognition result of a previous speech, and the numerical value of the second determination result is used to represent a probability that the speech recognition result conforms to a preset session semantic type; the voice recognition server is also used for determining the sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

In some optional implementation manners of this embodiment, the semantic server is further configured to determine a plurality of candidate values by using a plurality of preset session semantic type models; and determining the maximum value in the plurality of candidate values as the value of the second judgment result.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a speech processing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 4, the speech processing apparatus 400 of the present embodiment includes: a speech recognition unit 401, a text generation unit 402 and a feedback unit 403. The voice recognition unit 401 is configured to receive a user voice sent by the terminal device, perform voice recognition on the user voice, and obtain a voice recognition result; a text generating unit 402 configured to send the speech recognition result to the semantic server, and receive at least one reply text for the speech recognition result returned by the semantic server; a feedback unit 403 configured to send a reply text of the at least one reply text to the speech synthesis server, and forward the received reply speech sent by the speech synthesis server to the terminal device, where the reply speech is generated based on the reply text sent by the speech synthesis server.

In some embodiments, the speech recognition unit 401 of the speech processing apparatus 400 may receive the user speech transmitted by the terminal device. And, the execution subject can perform voice recognition on the user voice to obtain a voice recognition result. Specifically, speech recognition is the process of converting speech into corresponding text. The speech recognition result here refers to the converted text.

In some embodiments, the text generation unit 402 may send the obtained speech recognition result to the semantic server and receive a reply text returned by the semantic server. The reply text here is the reply text for the above-described speech recognition result. Specifically, the semantic server may analyze and process the speech recognition result to obtain a reply text for replying to the user in the process of interacting with the user.

In some embodiments, the feedback unit 403 may send the received reply text to the speech synthesis server to enable the speech synthesis server to perform speech synthesis to obtain the reply speech. Then, the execution body may receive the reply voice sent by the voice synthesis server, and forward the reply voice to the terminal device.

In some optional implementations of this embodiment, the apparatus further includes: a first transmitting unit configured to transmit a voice recognition result to the terminal device in response to obtaining the voice recognition result; and the method further comprises: and a second sending unit configured to send the reply text to the terminal device in response to receiving the reply text.

In some optional implementations of this embodiment, the apparatus further includes: the judging unit is configured to judge whether the voice recognition result is effective and is related to the recognition result of the last voice before sending the voice recognition result to the semantic server, and a first judgment result is generated, wherein the last voice and the user voice are in the same awakening interaction process; and a text generation unit including: the first sending module is configured to send the voice recognition result to the semantic server so that the semantic server judges whether the voice recognition result accords with a preset session semantic type and generates a second judgment result; and the apparatus further comprises: and a receiving unit configured to receive a second judgment result fed back by the semantic server before sending the voice recognition result to the terminal device, and determine whether the user voice is a meaningful voice based on the first judgment result and the second judgment result.

In some optional implementations of this embodiment, the first sending unit includes: the second sending module sends the voice recognition result to the terminal device in response to determining that the user voice is the meaningful voice.

In some optional implementations of this embodiment, the receiving unit includes: a determination module configured to determine the user speech as meaningful speech in response to determining at least one of the first determination result and the second determination result is yes.

In some optional implementation manners of this embodiment, the first determination result and the second determination result are represented in a form of a numerical value, where the numerical value of the first determination result is used to represent a probability that the speech recognition result is valid and is related to a recognition result of a previous speech, and the numerical value of the second determination result is used to represent a probability that the speech recognition result conforms to a preset session semantic type; and determining whether the user voice is a meaningful voice based on the first determination result and the second determination result, including: determining the sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

In some optional implementation manners of this embodiment, the numerical value of the second determination result is a maximum numerical value of a plurality of candidate numerical values determined by the semantic server using the plurality of preset session semantic type models.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 5 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a speech recognition unit, a text generation unit, and a feedback unit. The names of these units do not in some cases form a limitation on the unit itself, and for example, the speech recognition unit may also be described as "a unit that receives a user speech transmitted from the terminal device, performs speech recognition on the user speech, and obtains a speech recognition result".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: receiving user voice sent by terminal equipment, and carrying out voice recognition on the user voice to obtain a voice recognition result; sending a voice recognition result to a semantic server, and receiving a reply text aiming at the voice recognition result returned by the semantic server; and sending a reply text to the voice synthesis server, and forwarding the received reply voice sent by the voice synthesis server to the terminal equipment.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A speech processing method for a speech recognition server, the method comprising:

receiving user voice sent by terminal equipment, and carrying out voice recognition on the user voice to obtain a voice recognition result;

sending the voice recognition result to a semantic server, and receiving a reply text aiming at the voice recognition result returned by the semantic server;

sending the reply text to a voice synthesis server, and forwarding the received reply voice sent by the voice synthesis server to the terminal equipment;

the method further comprises the following steps: responding to the obtained voice recognition result, and sending the voice recognition result to the terminal equipment;

before the sending the speech recognition result to the semantic server, the method further comprises: judging whether the voice recognition result is valid and related to the recognition result of the previous voice to generate a first judgment result, wherein the previous voice and the user voice are in the same awakening interaction process; and

the sending the voice recognition result to the semantic server includes: sending the voice recognition result to the semantic server so that the semantic server judges whether the voice recognition result accords with a preset session semantic type and generates a second judgment result; and

before the sending of the speech recognition result to the terminal device, the method further includes: and receiving the second judgment result fed back by the semantic server, and determining whether the user voice is meaningful voice or not based on the first judgment result and the second judgment result.

2. The method of claim 1, wherein the speech recognition server is located within the same local area network as the semantic server and the speech synthesis server.

3. The method of claim 1, wherein the method further comprises:

and responding to the received reply text, and sending the reply text to the terminal equipment.

4. The method of claim 1, wherein the sending the speech recognition result to the terminal device comprises:

and in response to determining that the user voice is meaningful voice, sending the voice recognition result to the terminal device.

5. The method of claim 1, wherein the determining whether the user speech is meaningful speech based on the first determination result and the second determination result comprises:

determining that the user speech is meaningful speech in response to determining that at least one of the first determination result and the second determination result is yes.

6. The method according to claim 1, wherein the first determination result and the second determination result are represented in a numerical value, the numerical value of the first determination result is used for representing a probability that the speech recognition result is valid and related to a recognition result of a previous speech, and the numerical value of the second determination result is used for representing a probability that the speech recognition result conforms to a preset conversational semantic type; and

the determining whether the user voice is a meaningful voice based on the first determination result and the second determination result includes:

determining the sum of the numerical value of the first judgment result and the numerical value of the second judgment result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

7. The method of claim 6, wherein the value of the second determination result is a maximum value of a plurality of candidate values determined by the semantic server using a plurality of predetermined session semantic type models.

8. A speech processing system comprises a speech recognition server, a semantic server and a speech synthesis server;

the voice recognition server is used for receiving user voice sent by terminal equipment, carrying out voice recognition on the user voice to obtain a voice recognition result, sending the voice recognition result to the semantic server, sending a reply text returned by the semantic server to the voice synthesis server, receiving reply voice of the reply text sent by the voice synthesis server, and sending the reply voice to the terminal equipment;

the voice recognition server is further used for responding to the obtained voice recognition result and sending the voice recognition result to the terminal equipment;

the voice recognition server is further configured to, before the voice recognition result is sent to the semantic server, determine whether the voice recognition result is valid and related to a recognition result of a previous voice, and generate a first determination result, where the previous voice and the user voice are in a same wake-up interaction process;

the voice recognition server is also used for sending the voice recognition result to the semantic server;

the semantic server is further used for judging whether the voice recognition result accords with a preset session semantic type and generating a second judgment result; and

the voice recognition server is further configured to receive the second determination result fed back by the semantic server before the voice recognition result is sent to the terminal device, and determine whether the user voice is a meaningful voice based on the first determination result and the second determination result.

9. The system of claim 8, wherein the speech recognition server is located within the same local area network as the semantic server and the speech synthesis server.

10. The system of claim 8, wherein,

and the voice recognition server is also used for responding to the received reply text and sending the reply text to the terminal equipment.

11. The system of one of claims 8-10,

the semantic server is further configured to receive a text generation request, where the text generation request is sent to the semantic server by the terminal device in response to not receiving the reply text and the reply voice within a first preset time period, the text generation request includes the voice recognition result, and the first preset time period takes the voice recognition result received by the terminal device as a timing starting point.

12. The system of one of claims 8-10,

the speech synthesis server is further configured to receive a speech synthesis request, where the speech synthesis request is sent to the speech synthesis server by the terminal device in response to receiving the reply text and not receiving the reply speech within a second preset time period, the speech synthesis request includes the reply text, and the second preset time period uses the terminal device receiving the speech recognition result or uses the received reply text as a timing starting point.

13. The system of claim 8, wherein the speech recognition server is further configured to send the speech recognition result to the terminal device in response to determining that the user speech is meaningful speech.

14. The system of claim 8, wherein,

the speech recognition server is further configured to determine that the user speech is a meaningful speech in response to determining that at least one of the first determination result and the second determination result is yes.

15. The system according to claim 8, wherein the first determination result and the second determination result are represented in a numerical value, the numerical value of the first determination result is used for representing a probability that the speech recognition result is valid and related to a recognition result of a previous speech, and the numerical value of the second determination result is used for representing a probability that the speech recognition result conforms to a preset conversational semantic type; and

the voice recognition server is further configured to determine a sum of the numerical value of the first determination result and the numerical value of the second determination result; in response to determining that the sum is greater than or equal to a preset threshold, determining that the user speech is significant speech.

16. The system of claim 15, wherein,

the semantic server is further used for determining a plurality of candidate values by utilizing a plurality of preset session semantic type models; determining a maximum value among the plurality of candidate values as a value of the second determination result.

17. A speech processing apparatus for a speech recognition server, the apparatus comprising:

the voice recognition unit is configured to receive user voice sent by the terminal equipment, and perform voice recognition on the user voice to obtain a voice recognition result;

a text generation unit configured to send the voice recognition result to a semantic server, and receive at least one reply text for the voice recognition result returned by the semantic server;

a feedback unit configured to send a reply text of the at least one reply text to a speech synthesis server, and forward the received reply speech sent by the speech synthesis server to the terminal device, wherein the reply speech is generated based on the reply text sent by the speech synthesis server;

the device still includes: a first sending unit configured to send the voice recognition result to the terminal device in response to obtaining the voice recognition result;

the device still includes: the judging unit is configured to judge whether the voice recognition result is valid and is related to the recognition result of the last voice before the voice recognition result is sent to the semantic server, and a first judgment result is generated, wherein the last voice and the user voice are in the same awakening interaction process; and a text generation unit including: a first sending module configured to send the voice recognition result to the semantic server so that the semantic server judges whether the voice recognition result conforms to a preset session semantic type and generates a second judgment result; and the apparatus further comprises: a receiving unit configured to receive the second determination result fed back by the semantic server before the sending of the voice recognition result to the terminal device, and determine whether the user voice is a meaningful voice based on the first determination result and the second determination result.

18. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

19. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-7.