CN111524508A

CN111524508A - Voice conversation system and voice conversation implementation method

Info

Publication number: CN111524508A
Application number: CN201910108497.1A
Authority: CN
Inventors: 王欣; 马天泽; 林锋; 邵鹏
Original assignee: NIO Co Ltd
Current assignee: NIO Co Ltd
Priority date: 2019-02-03
Filing date: 2019-02-03
Publication date: 2020-08-11

Abstract

The invention relates to a voice conversation realization method and a voice conversation realization system. The method is used for realizing the voice conversation between the client and the server and comprises the following steps: a first transmission step of transmitting voice data from the client to the server; a conversion step, in which the server performs voice recognition and semantic understanding on the voice data and generates text data; and a second transmission step of transmitting the text data from the server to the client. According to the invention, the client can carry out voice recognition and semantic understanding on the voice data only by carrying out one-time communication with the server, and the voice recognition accuracy rate under a specific scene can be improved.

Description

Voice conversation system and voice conversation implementation method

Technical Field

The present invention relates to man-machine interaction technology, and in particular, to a voice dialog system and a voice dialog implementation method.

Background

NLU (natural language understanding) and ASR (automatic speech recognition) are important components of a dialog system, the ASR converts a user's speech input into text, the NLU semantically understands the text, recognizes the user's intention, and thus performs a corresponding task and performs a speech response.

In the prior art, the functionality of NLU and ASR are independent of each other, each provided in an independent module. Fig. 5 is a block diagram of the architecture of a current voice dialog system.

As shown in fig. 5, the communication process of the current voice dialog system includes two communications. Specifically, the first communication is that voice input is sent to an ASR system from a client, and voice data is converted into text by the ASR system and then returned to the client; the second communication is that the text obtained from the client is sent to the NLU system, and the NLU system carries out semantic understanding to obtain a corresponding response and then returns the response to the client.

Therefore, the client needs to perform communication twice to obtain a response, and the communication flow is complicated.

Disclosure of Invention

In view of the above problems, the present invention is directed to a voice dialog system and a voice dialog implementation method capable of simplifying a communication flow.

The invention discloses a voice dialogue realizing method, which is characterized in that the method is used for realizing voice dialogue between a client and a server and comprises the following steps:

a first transmission step of transmitting voice data from the client to the server;

a conversion step, in which the server performs voice recognition and semantic understanding on the voice data and generates text data; and

and a second transmission step of transmitting the text data from the server to the client.

Optionally, in the first transmission step, communication between the client and the server is established in a socket long connection manner.

Optionally, the converting step comprises the sub-steps of:

extracting the characteristics of the voice data and inputting the extracted characteristics into an acoustic model to obtain a score sequence;

searching in a static decoder based on the score sequence to obtain text data corresponding to the voice data, wherein the static decoder is preset with corpus data, and the corpus data comprises scene corpus data based on a scene; and

and post-processing the text data output by the decoder to obtain the text data in a preset format.

Optionally, in the process of searching in the static decoder based on the score sequence to obtain the text data corresponding to the speech data, the static decoder searches in the scene corpus data only when matching with data in the scene corpus data is required.

Optionally, in the first transmission step, decision supplementary information for the scenario decision is further sent to a server together with the voice data.

The present invention provides a voice conversation implementation system for implementing a voice conversation between a client and a server, comprising: a client and a server, wherein the server is connected with the client,

wherein the client is used for transmitting voice data to the server and receiving text data from the server,

the server is used for carrying out voice recognition and semantic understanding on the voice data, generating text data and transmitting the text data to the client.

Optionally, the communication between the client and the server is established in a socket long connection manner.

Optionally, the server includes:

the voice recognizer is used for extracting the characteristics of the voice data and inputting the extracted characteristics into an acoustic model to obtain a score sequence; and

the static decoder is used for searching the score sequence to obtain text data corresponding to the voice data, wherein the static decoder is preset with corpus data, and the corpus data comprises scene corpus data based on scenes; and

and the output module is used for carrying out post-processing on the text data output by the decoder so as to obtain the text data in a preset format.

Optionally, in the process of searching by the static decoder to obtain the text data corresponding to the speech data, the static decoder searches the scene corpus data only when the static decoder needs to match with data in the scene corpus data.

Optionally, the client sends decision supplementary information for the scenario decision to the server together with the voice data.

The computer-readable medium of the present invention, on which a computer program is stored, is characterized in that,

which computer program, when being executed by a processor, carries out the above-mentioned speech dialog realization method.

The computer device of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that the processor implements the above-mentioned voice conversation implementing method when executing the computer program.

As described above, according to the voice dialogue system and the voice dialogue implementation method of the present invention, by integrating voice recognition and semantic understanding into one service, the client can directly reply to the client after performing voice recognition and semantic understanding on voice data only by performing communication once with the server. Moreover, the speech recognition accuracy rate under a specific scene can be improved by adding two semantic understanding processes of scene decision and scene decoding network search. Furthermore, communication is carried out between the client and the server through establishing a socket long link, the socket link state is maintained through the conversation state, the link is kept until the conversation state is finished, and resource waste caused by frequent new links can be avoided.

Other features and advantages of the methods and apparatus of the present invention will be more particularly apparent from or elucidated with reference to the drawings described herein, and the following detailed description of the embodiments used to illustrate certain principles of the invention.

Drawings

Fig. 1 is a flowchart showing a voice conversation realization method according to an embodiment of the present invention.

Fig. 2 is a flowchart showing a specific procedure of the conversion step S200.

The data protocol for communication between the client 100 and the server 200 is shown in fig. 3.

Fig. 4 is a block diagram showing an architecture of a voice conversation realization system according to an embodiment of the present invention.

Fig. 5 is a block diagram of the architecture of a current voice dialog system.

Detailed Description

The following description is of some of the several embodiments of the invention and is intended to provide a basic understanding of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention.

As shown in fig. 1, the voice dialog according to an embodiment of the present invention is used for implementing a voice dialog between a client 100 and a server 200, and the method includes the following steps:

first transmission step S100: transmitting voice data from the client 100 to the server 200;

a conversion step S200: the server 200 performs voice recognition and semantic understanding on the voice data and generates text data; and

second transmission step S300: the text data is transmitted from the server 200 to the client 100.

According to the present application, both speech recognition and semantic understanding are accomplished at the conversion step S200. Thus, instead of the client needing to communicate one voice data to and from the remote end as in the prior art, the client 100 can obtain text data from the server 200 as long as it sends the voice data to the server 200.

The specific contents of the conversion step S200 will be explained here.

Fig. 2 is a flowchart showing a specific procedure of the conversion step S200.

As shown in fig. 2, the converting step S200 includes the following sub-steps: step S201, step S202, and step S203.

Next, these steps will be specifically described.

Step S201: and extracting the characteristics of the voice data and inputting the extracted characteristics into an acoustic model to obtain a score sequence of each state at each moment. The voice data is subjected to feature extraction, input into the acoustic model and score sequence, and conventional processing steps can be adopted, which are not the key points of the present invention, and thus details are not repeated here.

Step S202: based on the resulting sequence of scores, a search is made, for example in a static decoder (wfst), to obtain a result corresponding thereto, referred to herein as a search result. The static decoder comprises a state probability model and a language model, wherein the language model is generated by training collected linguistic data and a dictionary. And searching the maximum scoring path which accords with the language model constraint in the probability model based on the scoring sequence so as to obtain the optimal solution, namely the result which is most matched with the scoring sequence.

It should be noted that the corpus in the language model according to the present application includes scene corpus information, which may be related to various factors, such as the address book of the user, the specific voice habit of the user, the place name, etc., and all information helpful for understanding the semantics of the user and the specific user may be covered herein.

Based on this, step 202 is further explained as follows: in the process of searching based on the score sequence and in the static decoder, not only automatic speech recognition is obtained, but also more accurate information is obtained based on the corpus information, such as scene corpus information, in the recognition process, so as to give the best matching search result, namely, the text which is the best matched with the speech input. It should be understood that not every voice input needs to be matched in the scene prediction information, and the scene corpus information may not be searched if a determination result can be obtained without using the scene corpus information, but in the following example, the scene corpus information is retrieved because of uncertain semantics.

Therefore, compared with the prior art that ASR is independent of NLU automatic speech recognition as shown in FIG. 5, the NLU and the ASR are fused together, so that in the ASR stage, the semantic understanding part of the NLU can be adopted, and more accurate search results corresponding to speech input, namely original text data, can be given.

By way of example, it may be possible to perform the most basic speech recognition and then perform further searches in the scene corpus information as described above to obtain the optimal solution, which is particularly beneficial in cases where there are multiple understandings of speech.

Step S203: the search results (raw text data) of step 202 are post-processed and text results in a predetermined format are obtained.

Here, the speech dialogue implementing method of the present invention is explained by taking an example.

For example, the address book of the user a has a contact Chen I, the address book of the user B has a contact Chen I, and the address books of the users a and B are used as scene corpora. In the case where both user a and user B say "call chenyi", the scoring sequence of the speech inputs of user a and user B is obtained first (corresponding to step 201), and then, based on the scoring sequence, a search is performed at the static decoder, in this process, because of the scene corpus, the search result for user a will be accurately given as "call chenyi", and the search result for user B will be "call chenyi" (corresponding to step 202). Finally, the search results are post-processed to obtain a text result in a predetermined format (equivalent to step 203). The text result will be sent by the server to the client.

In particular, in the present application, communication is performed between the client 100 and the server 200 by establishing a socket long link. The socket refers to that two programs on a network realize data exchange through a bidirectional communication connection, and one end of the connection is called as a socket. The long socket link means that the client and the server only use one socket object in the whole communication process, and socket connection is kept for a long time. The data protocol for communication between the client 100 and the server 200 is shown in fig. 3.

As shown in fig. 3, a header portion, a voice data portion, and an end flag are included in the communication data.

The header part includes the header length and supplementary information needed for the semantic understanding to make the above decision, such as vehicle ID information, current location, bluetooth connection status, current navigation status, etc. (not mentioned in the example of fig. 2, but actually, these information can also be used as scene corpus information). For example, when the user wants to search for a restaurant in the vicinity, the client 100 makes an inquiry to the server 200, and according to the data protocol shown in fig. 3, the client 100 transmits the current location information to the server 200 in the header and then transmits audio data corresponding to "search for a restaurant in the vicinity", and the server 200 can perform a search using the current location information in the header after recognizing the intention.

Moreover, the socket link state is maintained through the conversation state, and the link is kept until the conversation state is finished, so that resource waste caused by frequent new links can be avoided.

The voice dialogue implementing method according to the present invention is explained above, and the voice dialogue implementing system according to the present invention is explained next.

As shown in fig. 4, the voice conversation realization system according to an embodiment of the present invention is used to realize a voice conversation between a client 100 and a server 200.

Wherein the client 100 is used to transmit voice data to the server 200 and to receive text data from the server 200. The server 200 is configured to perform speech recognition and semantic understanding on the speech data, generate text data, and transmit the text data to the client 100.

The client 100 includes:

a sending module 110, configured to send voice data; and

a receiving module 120, configured to receive text data.

Wherein the server 200 is configured to include:

the voice recognizer 210 is configured to perform feature extraction on the voice data and input the voice data into an acoustic model to obtain a score sequence of each state at each time;

a static decoder 220 that performs a search based on the score sequence to obtain text data corresponding to the speech data, wherein corpus data including scene corpus data based on a scene is preset in the static decoder; and

and an output module 230 for performing post-processing on the search result to obtain a text result in a predetermined format, wherein the output module 230 is, for example, a communication component.

In the process of searching by the static decoder 220 to obtain the text data corresponding to the speech data, the static decoder 220 searches the scene corpus data only when it needs to match with the data in the scene corpus data.

Preferably, a long socket link is established between the client 100 and the server 200, and the communication is performed by using the data protocol shown in fig. 3. Wherein the client 100 transmits decision supplementary information for the scenario decision together with voice data to the server 200.

The present invention also provides a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements the voice conversation implementing method described above.

The present invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the above-mentioned voice conversation implementing method when executing the computer program.

The above examples mainly illustrate the voice dialogue system and the voice dialogue implementing method of the present invention. Although only a few embodiments of the present invention have been described in detail, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A voice conversation realization method, for realizing a voice conversation between a client and a server, comprising the steps of:

2. The voice dialog implementation method of claim 1,

and in the first transmission step, communication between the client and the server is established in a socket long connection mode.

3. A method for speech dialog realization according to claim 1, characterized in that the conversion step comprises the following sub-steps:

4. The method according to claim 3, wherein in the searching in the static decoder based on the score sequence to obtain the text data corresponding to the speech data, the static decoder searches in the scene corpus data only when a match with data in the scene corpus data is required.

5. The voice dialog implementation method of claim 1,

in the first transmission step, decision side information for the scenario decision is further sent to a server together with the voice data.

6. A voice conversation realization system for realizing a voice conversation between a client and a server, comprising: a client and a server, wherein the server is connected with the client,

7. The voice dialog implementation method of claim 6,

and establishing communication between the client and the server in a socket long connection mode.

8. The voice conversation realization system according to claim 6, wherein said server comprises:

the voice recognizer is used for extracting the characteristics of the voice data and inputting the extracted characteristics into an acoustic model to obtain a score sequence;

9. The voice dialog implementation system of claim 8 wherein,

and in the process that the static decoder searches to obtain the text data corresponding to the voice data, the static decoder searches the scene corpus data only when the static decoder needs to be matched with the data in the scene corpus data.

10. The voice dialog implementation system of claim 8 wherein,

the client sends decision supplemental information for the scenario decision to the server along with the voice data.

11. A computer-readable medium, having stored thereon a computer program,

the computer program, when executed by a processor, implements the method of voice dialog implementation of any of claims 1-5.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when executing the computer program.