CN107943834B

CN107943834B - Method, device, equipment and storage medium for implementing man-machine conversation

Info

Publication number: CN107943834B
Application number: CN201711008491.4A
Authority: CN
Inventors: 常先堂; 远超; 陈怀亮; 米雪; 范中吉; 唐海员
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2021-06-11
Anticipated expiration: 2037-10-25
Also published as: CN107943834A

Abstract

The invention discloses a method, a device, equipment and a storage medium for realizing man-machine conversation, wherein the method comprises the following steps: the method comprises the steps that a client side obtains voice data of a user and sends the voice data to a voice recognition server so that the voice recognition server can carry out voice recognition on the voice data and send a voice recognition result to a semantic understanding server for semantic understanding; and the client acquires voice information generated by the voice synthesis server according to the acquired reply content, broadcasts the voice information to the user, and the reply content is generated by the semantic understanding server according to the semantic understanding result. By applying the scheme of the invention, the response speed of voice interaction can be improved.

Description

Method, device, equipment and storage medium for implementing man-machine conversation

[ technical field ] A method for producing a semiconductor device

The present invention relates to computer application technologies, and in particular, to a method, an apparatus, a device, and a storage medium for implementing a man-machine conversation.

[ background of the invention ]

In a man-machine dialogue system, a person and a machine perform natural language dialogue, that is, dialogue in human language, which mainly includes three processes: a Speech Recognition (ASR) process, a semantic Understanding (NLU) process, and a Speech synthesis (TTS) process.

The voice recognition process refers to a process of recognizing what content a user says by a machine, the semantic understanding process refers to a process of really understanding what content the user says by the machine, after the machine really understands the intention of the user, appropriate reply content needs to be given, the reply content needs to be synthesized into voice information and then is expressed in a voice broadcasting mode, and the process is a voice synthesis process.

In a conventional human-computer conversation system, the above three processes are usually performed in series, and each step is one or more times of HyperText Transfer Protocol (HTTP) or HTTPs network communication between a client and a server.

As shown in fig. 1 to 3, fig. 1 is a schematic diagram of a network communication method between a conventional client and a speech recognition Server (ASR Server), fig. 2 is a schematic diagram of a network communication method between a conventional client and a semantic understanding Server (NLU Server), and fig. 3 is a schematic diagram of a network communication method between a conventional client and a speech synthesis Server (TTS Server).

Since the network communication needs to be performed for multiple times, the speed of the man-machine conversation process (from the end of the man-machine speaking to the beginning of the machine broadcasting the voice information) is slow, that is, the response speed of the voice interaction is reduced.

[ summary of the invention ]

In view of this, the present invention provides a method, an apparatus, a device and a storage medium for implementing a man-machine interaction, which can improve the response speed of a voice interaction.

The specific technical scheme is as follows:

a method for implementing man-machine conversation comprises the following steps:

the method comprises the steps that a client side obtains voice data of a user, and sends the voice data to a voice recognition server, so that the voice recognition server can carry out voice recognition on the voice data, and sends a voice recognition result to a semantic understanding server for semantic understanding;

and the client acquires voice information generated by the voice synthesis server according to the acquired reply content, and broadcasts the voice information to the user, wherein the reply content is generated by the semantic understanding server according to a semantic understanding result.

According to a preferred embodiment of the present invention, the acquiring, by the client, the voice information generated by the voice synthesis server according to the acquired reply content includes:

the client acquires the voice information which is generated by the voice synthesis server according to the reply content and is sent to the client, wherein the reply content is sent to the client by the semantic understanding server through the voice recognition server and is sent to the voice synthesis server by the client;

alternatively, the first and second electrodes may be,

and the client acquires the voice information which is generated by the voice synthesis server according to the reply content and is sent to the client through the voice recognition server, wherein the reply content is sent to the voice synthesis server by the semantic understanding server through the voice recognition server.

the voice recognition server acquires voice data of a user from a client;

the voice recognition server carries out voice recognition on the voice data and sends a voice recognition result to a semantic understanding server for semantic understanding;

and the voice recognition server acquires reply content generated by the semantic understanding server according to a semantic understanding result, and sends the reply content or voice information obtained according to the reply content to the client.

According to a preferred embodiment of the present invention, the voice recognition server performs voice recognition on the voice data, and sends the voice recognition result to the semantic understanding server for semantic understanding, including:

before the voice recognition server obtains a final voice recognition result, when a sending condition is met every time, sending a part of the currently obtained voice recognition result to the semantic understanding server so that the semantic understanding server can carry out semantic understanding according to the obtained part of the voice recognition result to obtain a semantic understanding result;

when the voice recognition server obtains a final voice recognition result, the final voice recognition result is sent to the semantic understanding server, so that the semantic understanding server determines whether the part of the voice recognition result obtained each time before contains the final voice recognition result, if so, the semantic understanding result corresponding to the final voice recognition result obtained before is used as a final required semantic understanding result, and if not, the semantic understanding is carried out on the final voice recognition result to obtain the final required semantic understanding result.

According to a preferred embodiment of the present invention, the sending the reply content or the voice message obtained according to the reply content to the client includes:

the voice recognition server sends the reply content to the client, so that the client can obtain voice information which is returned by the voice synthesis server and is generated according to the reply content after sending the reply content to the voice synthesis server, and the voice information is broadcasted to the user;

alternatively, the first and second electrodes may be,

the voice recognition server sends the reply content to the voice synthesis server and acquires voice information which is returned by the voice synthesis server and is generated according to the reply content;

and the voice recognition server sends the voice information to the client so that the client can broadcast the voice information to the user.

the semantic understanding server acquires a voice recognition result from a voice recognition server and performs semantic understanding according to the voice recognition result, wherein the voice recognition result is obtained by performing voice recognition on voice data of a user acquired from a client by the voice recognition server;

and the semantic understanding server generates reply content according to a semantic understanding result and sends the reply content to the voice recognition server, so that the voice recognition server sends the reply content or voice information obtained according to the reply content to the client.

According to a preferred embodiment of the present invention, the semantic understanding server obtains a speech recognition result from a speech recognition server, and performing semantic understanding according to the speech recognition result includes:

the semantic understanding server carries out semantic understanding on part of the obtained voice recognition results each time to obtain semantic understanding results, wherein the part of the voice recognition results are the currently obtained part of the voice recognition results sent by the voice recognition server before the final voice recognition results are obtained and when the sending conditions are met each time;

the semantic understanding server acquires a final voice recognition result from the voice recognition server, determines whether a part of voice recognition results acquired each time before contains the final voice recognition result, if so, takes a semantic understanding result corresponding to the final voice recognition result acquired before as a final required semantic understanding result, and if not, carries out semantic understanding on the final voice recognition result to obtain the final required semantic understanding result.

the voice synthesis server acquires reply content which is sent by the voice recognition server and generated by the semantic understanding server according to the semantic understanding result; the semantic understanding result is obtained by the semantic understanding server by performing semantic understanding on a voice recognition result obtained from the voice recognition server, and the voice recognition result is obtained by the voice recognition server by performing voice recognition on voice data of a user obtained from a client;

and the voice synthesis server generates voice information according to the reply content and sends the voice information to the client through the voice recognition server so that the client can broadcast the voice information to the user.

An apparatus for implementing a human-computer conversation, comprising: a first processing unit and a second processing unit;

the first processing unit is used for acquiring voice data of a user, sending the voice data to a voice recognition server so that the voice recognition server can perform voice recognition on the voice data and send a voice recognition result to a semantic understanding server for semantic understanding;

and the second processing unit is used for acquiring voice information generated by the voice synthesis server according to the acquired reply content, and broadcasting the voice information to the user, wherein the reply content is generated by the semantic understanding server according to a semantic understanding result.

In accordance with a preferred embodiment of the present invention,

the second processing unit acquires the voice information generated and sent by the voice synthesis server according to the reply content, wherein the reply content is sent to the second processing unit by the semantic understanding server through the voice recognition server and sent to the voice synthesis server by the second processing unit;

alternatively, the first and second electrodes may be,

the second processing unit acquires the voice information which is generated by the voice synthesis server according to the reply content and is sent by the voice recognition server, wherein the reply content is sent to the voice synthesis server by the semantic understanding server through the voice recognition server.

An apparatus for implementing a human-computer conversation, comprising: a third processing unit, a fourth processing unit and a fifth processing unit;

the third processing unit is used for acquiring voice data of a user from a client;

the fourth processing unit is used for carrying out voice recognition on the voice data and sending a voice recognition result to the semantic understanding server for semantic understanding;

and the fifth processing unit is used for acquiring reply content generated by the semantic understanding server according to a semantic understanding result and sending the reply content or voice information obtained according to the reply content to the client.

In accordance with a preferred embodiment of the present invention,

before the final voice recognition result is obtained, the fourth processing unit sends a currently obtained partial voice recognition result to the semantic understanding server when the sending condition is met every time, so that the semantic understanding server performs semantic understanding according to the obtained partial voice recognition result to obtain a semantic understanding result;

the fourth processing unit sends the final voice recognition result to the semantic understanding server when the final voice recognition result is obtained, so that the semantic understanding server determines whether the part of the voice recognition results obtained before each time contains the final voice recognition result, if so, the semantic understanding result corresponding to the final voice recognition result obtained before is used as a final required semantic understanding result, and if not, the semantic understanding is performed on the final voice recognition result to obtain the final required semantic understanding result.

In accordance with a preferred embodiment of the present invention,

the fifth processing unit sends the reply content to the client, so that the client can acquire the voice information which is returned by the voice synthesis server and is generated according to the reply content after sending the reply content to the voice synthesis server, and can broadcast the voice information to the user;

alternatively, the first and second electrodes may be,

the fifth processing unit sends the reply content to the voice synthesis server and acquires voice information which is returned by the voice synthesis server and is generated according to the reply content;

and the fifth processing unit sends the voice information to the client so that the client can broadcast the voice information to the user.

An apparatus for implementing a human-computer conversation, comprising: a sixth processing unit and a seventh processing unit;

the sixth processing unit is configured to acquire a voice recognition result from a voice recognition server, and perform semantic understanding according to the voice recognition result, where the voice recognition result is obtained by the voice recognition server by performing voice recognition on voice data acquired from a user at a client;

the seventh processing unit is configured to generate reply content according to a semantic understanding result, and send the reply content to the speech recognition server, so that the speech recognition server sends the reply content or speech information obtained according to the reply content to the client.

In accordance with a preferred embodiment of the present invention,

the sixth processing unit performs semantic understanding on a part of the obtained voice recognition result each time to obtain a semantic understanding result, wherein the part of the voice recognition result is a currently obtained part of the voice recognition result sent by the voice recognition server before a final voice recognition result is obtained and when a sending condition is met each time;

the sixth processing unit acquires a final voice recognition result from the voice recognition server, determines whether the part of the voice recognition result acquired each time before contains the final voice recognition result, if so, takes a semantic understanding result corresponding to the final voice recognition result acquired before as a final required semantic understanding result, and if not, performs semantic understanding on the final voice recognition result to obtain the final required semantic understanding result.

An apparatus for implementing a human-computer conversation, comprising: an eighth processing unit and a ninth processing unit;

the eighth processing unit is configured to acquire reply content sent by the voice recognition server and generated by the semantic understanding server according to the semantic understanding result; the semantic understanding result is obtained by the semantic understanding server by performing semantic understanding on a voice recognition result obtained from the voice recognition server, and the voice recognition result is obtained by the voice recognition server by performing voice recognition on voice data of a user obtained from a client;

and the ninth processing unit is used for generating voice information according to the reply content and sending the voice information to the client through the voice recognition server so that the client can broadcast the voice information to the user.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.

Based on the above description, it can be seen that, by adopting the solution of the present invention, the voice recognition server performs voice recognition on the voice data of the user obtained from the client, and after obtaining the voice recognition result, the voice recognition result is sent to the semantic understanding server for semantic understanding without being returned to the client, and accordingly, the semantic understanding server can directly return the reply content generated according to the semantic understanding result to the voice recognition server, and the voice recognition server can directly send the reply content to the voice synthesis server and send the voice information obtained from the voice synthesis server to the client for broadcasting, compared with the prior art, in the solution of the present invention, the network communication between the client and the semantic understanding server and the network communication between the voice recognition server and the voice synthesis server can be used to replace the network communication between the client and the semantic understanding server and between the client and the voice synthesis server, the network communication speed between the server and the server is superior to that between the client and the server, so that the response speed of voice interaction in the man-machine conversation process is improved.

[ description of the drawings ]

Fig. 1 is a schematic diagram of a network communication method between a conventional client and a speech recognition server.

Fig. 2 is a schematic diagram of a network communication manner between a conventional client and a semantic understanding server.

Fig. 3 is a schematic diagram illustrating a network communication method between a conventional client and a speech synthesis server.

Fig. 4 is a schematic diagram of a network communication mode among the client, the speech recognition server and the semantic understanding server according to the present invention.

Fig. 5 is a schematic diagram of a network communication mode among the client, the speech recognition server, the semantic understanding server and the speech synthesis server according to the present invention.

Fig. 6 is a flowchart of a first embodiment of a method for implementing a man-machine conversation according to the present invention.

Fig. 7 is a flowchart of a method for implementing a man-machine conversation according to a second embodiment of the present invention.

Fig. 8 is a flowchart of a method for implementing a man-machine conversation according to a third embodiment of the present invention.

Fig. 9 is a flowchart of a method for implementing a man-machine conversation according to a fourth embodiment of the present invention.

Fig. 10 is a schematic structural diagram of a first embodiment of an apparatus for implementing a man-machine conversation according to the present invention.

Fig. 11 is a schematic structural diagram of a device for implementing a man-machine conversation according to a second embodiment of the present invention.

Fig. 12 is a schematic structural diagram of a device for implementing a man-machine conversation according to a third embodiment of the present invention.

Fig. 13 is a schematic structural diagram of a device for implementing a man-machine conversation according to a fourth embodiment of the present invention.

FIG. 14 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.

[ detailed description ] embodiments

Aiming at the problems in the prior art, the invention provides a realization scheme of man-machine conversation, which can improve the response speed of voice interaction in the man-machine conversation process.

In the prior art, the two processes of voice recognition and semantic understanding are processed separately, and the client communicates with the voice recognition server and the semantic understanding server respectively through a network, but in the scheme of the invention, the voice recognition server and the semantic understanding server can be integrated at the server, that is, fig. 1 and fig. 2 are combined into fig. 4, and fig. 4 is a schematic diagram of a network communication mode among the client, the voice recognition server and the semantic understanding server.

As shown in fig. 4, after acquiring the voice data of the user, the client sends the voice data to the voice recognition server, the voice recognition server performs voice recognition on the voice data to obtain a voice recognition result, and then sends the voice recognition result to the semantic understanding server instead of returning the voice recognition result to the client, and the semantic understanding server performs semantic understanding on the voice recognition result to obtain a semantic understanding result, generates reply content according to the semantic understanding result, sends the reply content to the voice recognition server, and then sends the reply content to the client by the voice recognition server.

And then, the client can send the reply content to the voice synthesis server, the voice synthesis server generates corresponding voice information and returns the voice information to the client, and the client broadcasts the voice information to the user.

In the processing mode, the network communication between the client and the semantic understanding server is replaced by the network communication between the voice recognition server and the semantic understanding server, and the network communication speed between the server and the server is superior to that between the client and the server, so that the response speed of voice interaction in the man-machine conversation process is improved. The practical test result shows that the processing method can save about 100-120 ms of time.

In addition, in practical application, when the voice recognition server performs voice recognition on voice data, a streaming mode is adopted, and voice recognition is performed on the voice stream partially.

That is, before the voice recognition server obtains the final voice recognition result, when the sending condition is met each time, the voice recognition server sends the currently obtained partial voice recognition result (partial result) to the semantic understanding server, and the semantic understanding server performs semantic understanding according to the obtained partial voice recognition result to obtain a semantic understanding result.

And then, when the final voice recognition result is obtained, the voice recognition server sends the final voice recognition result to a semantic understanding server, the semantic understanding server determines whether the part of the voice recognition results obtained at each time before contains the final voice recognition result, if so, the semantic understanding result corresponding to the final voice recognition result obtained before is used as the final required semantic understanding result, and if not, the semantic understanding is carried out on the final voice recognition result to obtain the final required semantic understanding result.

That is, before the final speech recognition result is obtained, the speech recognition server may send the obtained partial speech recognition result to the semantic understanding server for semantic understanding, when the final voice recognition result is obtained, the voice recognition server can inform the semantic understanding server in a certain way that the final voice recognition result is obtained, and after the semantic understanding server judges that the final voice recognition result is obtained, if the partial speech recognition result sent some time before and the final speech recognition result are determined to be the same, the semantic understanding result corresponding to the partial speech recognition result can be directly used as the final required semantic understanding result, no extra time is needed for semantic understanding, thus saving time, this process may be referred to as "predictive prefetching," i.e., requesting semantic understanding results ahead of time when the user is not finished speaking.

For example, when the voice data of the user is collected, the user finishes speaking at the time a, stops collecting at the time B, and the time B lags behind the time a, so that a partial voice recognition result obtained by performing voice recognition on the voice data ending at the time a is likely to be the same as a final voice recognition result obtained by performing voice recognition on the voice data ending at the time B.

Actual test results show that the success rate of prediction prefetching can reach 65% -75%, and by means of the method, about 300-350 ms of time can be saved.

As described above, after acquiring the reply content from the voice recognition server, the client may send the reply content to the voice synthesis server, and the voice synthesis server generates corresponding voice information and returns the voice information to the client, and the client broadcasts the voice information to the user.

According to the scheme provided by the invention, after the reply content is obtained, the voice recognition server can directly send the reply content to the voice synthesis server instead of sending the reply content to the client, and obtains the voice information returned by the voice synthesis server and sends the voice information to the client.

Through the mode, the network communication between the client and the voice synthesis server is replaced by the network communication between the voice recognition server and the voice synthesis server, and the network communication speed between the server and the server is superior to that between the client and the server, so that the response speed of voice interaction in the man-machine conversation process is further improved.

Through the mode, the three-in-one operation of the voice recognition server, the semantic understanding server and the voice synthesis server is realized. Fig. 5 is a schematic diagram of a network communication mode among the client, the speech recognition server, the semantic understanding server and the speech synthesis server according to the present invention. It can be seen that, except for the network communication between the client and the voice recognition server, the network communication between the client and the server is the network communication between the server and the server.

The actual test result shows that the time of about 100-120 ms can be further saved by adopting a mode of directly carrying out network communication between the voice recognition server and the voice synthesis server.

In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.

It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 6 is a flowchart of a first embodiment of a method for implementing a man-machine conversation according to the present invention. As shown in fig. 6, the following detailed implementation is included.

In 601, the client acquires voice data of the user, sends the voice data to the voice recognition server so that the voice recognition server performs voice recognition on the voice data, and sends a voice recognition result to the semantic understanding server for semantic understanding.

The client sends the voice data of the user to the voice recognition server for voice recognition, and the voice recognition server directly sends the voice recognition result to the semantic understanding server so that the semantic understanding server can carry out semantic understanding on the voice recognition result.

In 602, the client acquires the voice information generated by the voice synthesis server according to the acquired reply content, and broadcasts the voice information to the user, where the reply content is generated by the semantic understanding server according to the semantic understanding result.

After the semantic understanding server obtains the semantic understanding result, reply content aiming at the voice data of the user can be generated according to the prior art, and the reply content is sent to the voice recognition server.

The voice recognition server can adopt two processing modes, one is to send the reply content to the client, and the other is to send the reply content to the voice synthesis server.

If the voice recognition server sends the reply content to the client, the client can further send the reply content to the voice synthesis server and acquire the voice information which is generated by the voice synthesis server according to the reply content and sent to the client.

If the voice recognition server sends the reply content to the voice synthesis server, the voice synthesis server can return the voice information to the voice recognition server after generating the voice information, and the voice recognition server sends the voice information to the client.

Fig. 7 is a flowchart of a method for implementing a man-machine conversation according to a second embodiment of the present invention. As shown in fig. 7, the following detailed implementation is included.

In 701, a voice recognition server obtains voice data from a user of a client.

At 702, the voice recognition server performs voice recognition on the acquired voice data and sends the voice recognition result to the semantic understanding server for semantic understanding.

Before the voice recognition server obtains the final voice recognition result, when the voice recognition server meets the sending condition every time, the voice recognition server can send the currently obtained partial voice recognition result to the semantic understanding server, so that the semantic understanding server can carry out semantic understanding according to the obtained partial voice recognition result to obtain a semantic understanding result.

The sending condition is satisfied, which is specific to what kind of condition can be determined according to actual needs. For example, as described above, when performing speech recognition on speech data, the speech recognition server performs speech recognition on the speech data in a streaming manner, and performs speech recognition on the speech stream partially, and then may send the currently obtained partial speech recognition result to the semantic understanding server after each partial recognition is completed, where the currently obtained partial speech recognition result refers to all currently obtained recognition results, and the currently obtained speech recognition result is usually incomplete relative to the final speech recognition result, and thus may be referred to as a partial speech recognition result.

The semantic understanding server can carry out semantic understanding on part of the voice recognition results acquired each time, so that semantic understanding results are obtained.

Then, when the voice recognition server obtains the final voice recognition result, the final voice recognition result is sent to the semantic understanding server, and the semantic understanding server is informed of the final voice recognition result in a certain mode, accordingly, the semantic understanding server can determine whether the part of the voice recognition results obtained in each time before contains the final voice recognition result, if so, the semantic understanding result corresponding to the final voice recognition result obtained before is used as the final required semantic understanding result, and if not, the semantic understanding is carried out on the final voice recognition result to obtain the final required semantic understanding result.

For example, the following steps are carried out:

the speech recognition server sends twice partial speech recognition results to the semantic understanding server, wherein the two partial speech recognition results are respectively a partial speech recognition result a and a partial speech recognition result b, if the final speech recognition result sent to the semantic understanding server by the speech recognition server is the same as the partial speech recognition result b, the semantic understanding server can take the semantic understanding result corresponding to the partial speech recognition result b as the final required semantic understanding result, and if the final speech recognition result sent to the semantic understanding server by the speech recognition server is different from the partial speech recognition result a and the partial speech recognition result b, the semantic understanding server needs to carry out semantic understanding on the final speech recognition result, so that the final required semantic understanding result is obtained.

In 703, the speech recognition server obtains reply content generated by the semantic understanding server according to the semantic understanding result, and sends the reply content or the speech information obtained according to the reply content to the client.

The semantic understanding server can generate reply content according to the finally needed semantic understanding result according to the prior art, and send the reply content to the voice recognition server.

The voice recognition server can send the reply content to the client so that the client can obtain the voice information which is returned by the voice synthesis server and is generated according to the reply content after sending the reply content to the voice synthesis server, and the voice information is broadcasted to the user. Or, the voice recognition server may also send the reply content to the voice synthesis server, and obtain the voice information returned by the voice synthesis server, and the voice recognition server further sends the voice information to the client.

Fig. 8 is a flowchart of a method for implementing a man-machine conversation according to a third embodiment of the present invention. As shown in fig. 8, the following detailed implementation is included.

In 801, a semantic understanding server obtains a voice recognition result from a voice recognition server, and performs semantic understanding according to the voice recognition result, wherein the voice recognition result is obtained by performing voice recognition on voice data obtained from a user of a client by the voice recognition server.

The voice recognition server acquires voice data of a user from the client, and performs voice recognition on the voice data to obtain a voice recognition result.

Before the voice recognition server obtains the final voice recognition result, when the voice recognition server meets the sending condition each time, the voice recognition server can send a part of the currently obtained voice recognition result to the semantic understanding server, and correspondingly, the semantic understanding server can carry out semantic understanding on the part of the voice recognition result obtained each time to obtain a semantic understanding result.

Then, when the speech recognition server obtains the final speech recognition result, the speech recognition server can send the final speech recognition result to the semantic understanding server, and can inform the semantic understanding server in some way that the final speech recognition result is, accordingly, the semantic understanding server can determine whether the part of the speech recognition results obtained each time before contains the final speech recognition result, if so, the semantic understanding result corresponding to the final speech recognition result obtained before is used as the final required semantic understanding result, and if not, the semantic understanding is performed on the final speech recognition result to obtain the final required semantic understanding result.

In 802, the semantic understanding server generates reply content according to the semantic understanding result and sends the reply content to the voice recognition server, so that the voice recognition server sends the reply content or the voice information obtained according to the reply content to the client.

And then, the voice recognition server can send the reply content to the client so that the client can obtain the voice information which is returned by the voice synthesis server and is generated according to the reply content after sending the reply content to the voice synthesis server, and broadcast the voice information to the user. Or, the voice recognition server may also send the reply content to the voice synthesis server, and obtain the voice information returned by the voice synthesis server, and the voice recognition server further sends the voice information to the client.

Fig. 9 is a flowchart of a method for implementing a man-machine conversation according to a fourth embodiment of the present invention. As shown in fig. 9, the following detailed implementation is included.

In 901, the speech synthesis server obtains reply content sent by the speech recognition server and generated by the semantic understanding server according to the semantic understanding result; the semantic understanding result is obtained by semantically understanding the voice recognition result obtained from the voice recognition server by the semantic understanding server, and the voice recognition result is obtained by voice recognizing the voice data obtained from the user of the client by the voice recognition server.

At 902, the voice synthesis server generates voice information according to the reply content and sends the voice information to the client through the voice recognition server, so that the client broadcasts the voice information to the user.

After the voice recognition server acquires the reply content from the semantic understanding server, the reply content can be directly sent to the voice synthesis server, the voice synthesis server can generate corresponding voice information and send the voice information to the voice recognition server, and then the voice recognition server sends the voice information to the client.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In a word, by adopting the scheme of each method embodiment, the network communication between the speech recognition server and the semantic understanding server and the network communication between the speech recognition server and the speech synthesis server can be used for replacing the network communication between the client and the semantic understanding server and the network communication between the speech recognition server and the speech synthesis server, and the network communication speed between the server and the server is superior to the network communication between the client and the server, so that the response speed of speech interaction in the man-machine conversation process is improved, and the time required by semantic understanding is saved by adopting a prediction prefetching mode, so that the response speed of speech interaction in the man-machine conversation process is further improved.

The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.

Fig. 10 is a schematic structural diagram of a first embodiment of an apparatus for implementing a man-machine conversation according to the present invention. As shown in fig. 10, includes: a first processing unit 1001 and a second processing unit 1002.

The first processing unit 1001 is configured to acquire voice data of a user, send the voice data to a voice recognition server, so that the voice recognition server performs voice recognition on the voice data, and send a voice recognition result to a semantic understanding server for semantic understanding.

The second processing unit 1002 is configured to acquire voice information generated by the voice synthesis server according to the acquired reply content, and broadcast the voice information to the user, where the reply content is generated by the semantic understanding server according to the semantic understanding result.

The second processing unit 1002 may obtain the voice information generated and sent by the voice synthesis server according to the reply content, where the reply content is sent to the second processing unit 1002 by the semantic understanding server through the voice recognition server, and sent to the voice synthesis server by the second processing unit 1002.

Alternatively, the second processing unit 1002 may acquire the voice information that the voice synthesis server generates and transmits through the voice recognition server according to the reply content that the semantic understanding server transmits to the voice synthesis server through the voice recognition server.

Fig. 11 is a schematic structural diagram of a device for implementing a man-machine conversation according to a second embodiment of the present invention. As shown in fig. 11, includes: a third processing unit 1101, a fourth processing unit 1102 and a fifth processing unit 1103.

A third processing unit 1101 for acquiring voice data from a user of the client.

The fourth processing unit 1102 is configured to perform speech recognition on the speech data, and send a speech recognition result to the semantic understanding server for semantic understanding.

A fifth processing unit 1103, configured to acquire reply content generated by the semantic understanding server according to the semantic understanding result, and send the reply content or voice information obtained according to the reply content to the client.

Before the final voice recognition result is obtained, when the sending condition is met each time, the fourth processing unit 1102 may send a part of the currently obtained voice recognition result to the semantic understanding server, so that the semantic understanding server performs semantic understanding according to the part of the obtained voice recognition result to obtain a semantic understanding result.

When the final voice recognition result is obtained, the fourth processing unit 1102 may send the final voice recognition result to the semantic understanding server, so that the semantic understanding server determines whether the part of the voice recognition results obtained each time before already includes the final voice recognition result, if so, the semantic understanding result corresponding to the final voice recognition result obtained before is used as the final required semantic understanding result, and if not, the semantic understanding is performed on the final voice recognition result to obtain the final required semantic understanding result.

The fifth processing unit 1103 may send the reply content to the client, so that after the client sends the reply content to the voice synthesis server, the client obtains the voice information returned by the voice synthesis server and generated according to the reply content, and broadcasts the voice information to the user.

Alternatively, the fifth processing unit 1103 may also send the reply content to the speech synthesis server, and obtain the speech information generated according to the reply content and returned by the speech synthesis server, and then the fifth processing unit 1103 may send the speech information to the client, so that the client broadcasts the speech information to the user.

Fig. 12 is a schematic structural diagram of a device for implementing a man-machine conversation according to a third embodiment of the present invention. As shown in fig. 12, includes: a sixth processing unit 1201 and a seventh processing unit 1202.

A sixth processing unit 1201, configured to acquire a speech recognition result from the speech recognition server, and perform semantic understanding according to the speech recognition result, where the speech recognition result is obtained by performing speech recognition on speech data acquired from a user at the client by the speech recognition server.

And a seventh processing unit 1202, configured to generate a reply content according to the semantic understanding result, and send the reply content to the speech recognition server, so that the speech recognition server sends the reply content or the speech information obtained according to the reply content to the client.

The sixth processing unit 1201 may perform semantic understanding on the partial voice recognition result obtained each time to obtain a semantic understanding result, where the partial voice recognition result is the currently obtained partial voice recognition result sent by the voice recognition server before the final voice recognition result is obtained and when the sending condition is met each time.

The sixth processing unit 1202 may further obtain a final speech recognition result from the speech recognition server, and determine whether a part of speech recognition results obtained each time before already includes the final speech recognition result, if so, take a semantic understanding result corresponding to the final speech recognition result obtained before as a final required semantic understanding result, and if not, perform semantic understanding on the final speech recognition result to obtain a final required semantic understanding result.

Fig. 13 is a schematic structural diagram of a device for implementing a man-machine conversation according to a fourth embodiment of the present invention. As shown in fig. 13, includes: an eighth processing unit 1301 and a ninth processing unit 1302.

An eighth processing unit 1301, configured to acquire reply content sent by the voice recognition server and generated by the semantic understanding server according to the semantic understanding result; the semantic understanding result is obtained by semantically understanding the voice recognition result obtained from the voice recognition server by the semantic understanding server, and the voice recognition result is obtained by voice recognizing the voice data obtained from the user of the client by the voice recognition server.

The ninth processing unit 1302 is configured to generate voice information according to the reply content, and send the voice information to the client through the voice recognition server, so that the client broadcasts the voice information to the user.

For the specific work flow of the above device embodiments, please refer to the related description of the above method embodiments, and further description is omitted.

In summary, by adopting the solutions described in the embodiments of the above-mentioned apparatuses, network communication between the speech recognition server and the semantic understanding server and network communication between the speech recognition server and the speech synthesis server can be used to replace network communication between the client and the semantic understanding server and between the speech recognition server and the speech synthesis server, and the network communication speed between the server and the server is superior to the network communication between the client and the server, so as to improve the response speed of speech interaction in the process of man-machine interaction.

FIG. 14 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 14 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.

As shown in FIG. 14, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 14, and commonly referred to as a "hard drive"). Although not shown in FIG. 14, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 14, the network adapter 20 communicates with the other modules of the computer system/server 12 via the bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 16 executes various functional applications and data processing by executing programs stored in the memory 28, for example, implementing the methods in the embodiments shown in fig. 6, 7, 8 or 9.

The invention also discloses a computer-readable storage medium on which a computer program is stored which, when executed by a processor, implements the method as in the embodiments shown in fig. 6, 7, 8 or 9.

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, or a combination thereof,

Smalltalk, C + +, and also includes conventional procedural programming languages, such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for implementing a man-machine conversation, comprising:

the method comprises the steps that a client side obtains voice data of a user, and sends the voice data to a voice recognition server, so that the voice recognition server can carry out voice recognition on the voice data, and sends a voice recognition result to a semantic understanding server for semantic understanding; wherein, the voice recognition server is further configured to, before the final voice recognition result is obtained, send a currently obtained partial voice recognition result to the semantic understanding server when a sending condition is met each time, where the currently obtained partial voice recognition result refers to all currently obtained recognition results, so that the semantic understanding server performs semantic understanding according to the obtained partial voice recognition result to obtain a semantic understanding result, and when the voice recognition server obtains the final voice recognition result, send the final voice recognition result to the semantic understanding server, so that the semantic understanding server determines whether the previously obtained partial voice recognition result includes the final voice recognition result, and if so, take the semantic understanding result corresponding to the final voice recognition result obtained as a final required semantic understanding result, if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

2. The method of claim 1,

the client acquires the voice information generated by the voice synthesis server according to the acquired reply content, and the method comprises the following steps:

alternatively, the first and second electrodes may be,

3. A method for implementing a man-machine conversation, comprising:

the voice recognition server acquires voice data of a user from a client;

the voice recognition server carries out voice recognition on the voice data and sends a voice recognition result to a semantic understanding server for semantic understanding, and the method comprises the following steps: before the voice recognition server obtains a final voice recognition result, when a sending condition is met every time, sending a part of currently obtained voice recognition results to the semantic understanding server, wherein the part of currently obtained voice recognition results refer to all currently obtained recognition results, so that the semantic understanding server can carry out semantic understanding according to the part of obtained voice recognition results to obtain semantic understanding results; when a final voice recognition result is obtained, sending the final voice recognition result to the semantic understanding server so that the semantic understanding server can determine whether part of the voice recognition results obtained at each time contain the final voice recognition result, if so, taking a semantic understanding result corresponding to the obtained final voice recognition result as a final required semantic understanding result, and if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

4. The method of claim 3,

the sending the reply content or the voice message obtained according to the reply content to the client comprises:

alternatively, the first and second electrodes may be,

5. A method for implementing a man-machine conversation, comprising:

the semantic understanding server acquires a voice recognition result from a voice recognition server, wherein the voice recognition result is obtained by performing voice recognition on voice data of a user acquired from a client by the voice recognition server, and performs semantic understanding according to the voice recognition result, and the semantic understanding method comprises the following steps: before the final voice recognition result is obtained, semantic understanding is carried out on the partial voice recognition result obtained from the voice recognition server each time, the partial voice recognition result is a currently acquired partial voice recognition result transmitted by the voice recognition server every time when a transmission condition is met, the currently acquired partial speech recognition result refers to all the currently acquired recognition results, and obtaining the final voice recognition result from the voice recognition server, determining whether the part of the voice recognition result obtained each time before contains the final voice recognition result, if yes, and taking the semantic understanding result corresponding to the final speech recognition result as the final required semantic understanding result, if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

6. A method for implementing a man-machine conversation, comprising:

the voice synthesis server acquires reply content which is sent by the voice recognition server and generated by the semantic understanding server according to the semantic understanding result; the semantic understanding result is obtained by the semantic understanding server by performing semantic understanding on a voice recognition result obtained from the voice recognition server, and the voice recognition result is obtained by the voice recognition server by performing voice recognition on voice data of a user obtained from a client; wherein the semantic understanding result is obtained by the semantic understanding server in the following way: before the final voice recognition result is obtained, semantic understanding is carried out on the partial voice recognition result obtained from the voice recognition server each time, the partial voice recognition result is a currently acquired partial voice recognition result transmitted by the voice recognition server every time when a transmission condition is met, the currently acquired partial speech recognition result refers to all the currently acquired recognition results, and obtaining the final voice recognition result from the voice recognition server, determining whether the part of the voice recognition result obtained each time before contains the final voice recognition result, if yes, and taking the semantic understanding result corresponding to the final speech recognition result as the final required semantic understanding result, if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

7. A device for implementing a man-machine conversation, comprising: a first processing unit and a second processing unit;

the first processing unit is used for acquiring voice data of a user, sending the voice data to a voice recognition server so that the voice recognition server can perform voice recognition on the voice data and send a voice recognition result to a semantic understanding server for semantic understanding; wherein, the voice recognition server is further configured to, before the final voice recognition result is obtained, send a currently obtained partial voice recognition result to the semantic understanding server when a sending condition is met each time, where the currently obtained partial voice recognition result refers to all currently obtained recognition results, so that the semantic understanding server performs semantic understanding according to the obtained partial voice recognition result to obtain a semantic understanding result, and when the voice recognition server obtains the final voice recognition result, send the final voice recognition result to the semantic understanding server, so that the semantic understanding server determines whether the previously obtained partial voice recognition result includes the final voice recognition result, and if so, take the semantic understanding result corresponding to the final voice recognition result obtained as a final required semantic understanding result, if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

8. A device for implementing a human-computer conversation according to claim 7,

alternatively, the first and second electrodes may be,

9. A device for implementing a man-machine conversation, comprising: a third processing unit, a fourth processing unit and a fifth processing unit;

the fourth processing unit is configured to perform speech recognition on the speech data, and send a speech recognition result to a semantic understanding server for semantic understanding, and includes: before the final voice recognition result is obtained, when the sending condition is met every time, sending a part of currently obtained voice recognition results to the semantic understanding server, wherein the part of currently obtained voice recognition results refer to all currently obtained recognition results, so that the semantic understanding server carries out semantic understanding according to the obtained part of voice recognition results to obtain semantic understanding results; when a final voice recognition result is obtained, sending the final voice recognition result to the semantic understanding server so that the semantic understanding server can determine whether part of the voice recognition results obtained at each time contain the final voice recognition result, if so, taking a semantic understanding result corresponding to the obtained final voice recognition result as a final required semantic understanding result, and if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

10. Device for the implementation of a human-machine dialog according to claim 9,

alternatively, the first and second electrodes may be,

11. A device for implementing a man-machine conversation, comprising: a sixth processing unit and a seventh processing unit;

the sixth processing unit is configured to acquire a speech recognition result from a speech recognition server, where the speech recognition result is obtained by performing speech recognition on speech data acquired from a user at a client by the speech recognition server, and perform semantic understanding according to the speech recognition result, and includes: before the final voice recognition result is obtained, semantic understanding is carried out on the partial voice recognition result obtained from the voice recognition server each time, the partial voice recognition result is a currently acquired partial voice recognition result transmitted by the voice recognition server every time when a transmission condition is met, the currently acquired partial speech recognition result refers to all the currently acquired recognition results, and obtaining the final voice recognition result from the voice recognition server, determining whether the part of the voice recognition result obtained each time before contains the final voice recognition result, if yes, and taking the semantic understanding result corresponding to the final speech recognition result as the final required semantic understanding result, if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

12. A device for implementing a man-machine conversation, comprising: an eighth processing unit and a ninth processing unit;

the eighth processing unit is configured to acquire reply content sent by the voice recognition server and generated by the semantic understanding server according to the semantic understanding result; the semantic understanding result is obtained by the semantic understanding server by performing semantic understanding on a voice recognition result obtained from the voice recognition server, and the voice recognition result is obtained by the voice recognition server by performing voice recognition on voice data of a user obtained from a client; wherein the semantic understanding result is obtained by the semantic understanding server in the following way: before the final voice recognition result is obtained, semantic understanding is carried out on the partial voice recognition result obtained from the voice recognition server each time, the partial voice recognition result is a currently acquired partial voice recognition result transmitted by the voice recognition server every time when a transmission condition is met, the currently acquired partial speech recognition result refers to all the currently acquired recognition results, and obtaining the final voice recognition result from the voice recognition server, determining whether the part of the voice recognition result obtained each time before contains the final voice recognition result, if yes, and taking the semantic understanding result corresponding to the final speech recognition result as the final required semantic understanding result, if not, performing semantic understanding on the final voice recognition result to obtain a final required semantic understanding result;

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1-2.

14. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method of any one of claims 1-2.

15. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 3 to 4.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 3 to 4.

17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method as claimed in claim 5 when executing the program.

18. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method as claimed in claim 5.

19. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method as claimed in claim 6 when executing the program.

20. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method as claimed in claim 6.