CN111816190A

CN111816190A - Voice interaction method and device for upper computer and lower computer

Info

Publication number: CN111816190A
Application number: CN202010654204.2A
Authority: CN
Inventors: 宋泽; 甘津瑞; 邓建凯
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-10-23

Abstract

The invention discloses a voice interaction method and a voice interaction device for an upper computer and a lower computer, wherein the method comprises the following steps: responding to the input audio of a user, and judging whether the user is in an awakening state; if the input audio is not in the awakening state, sending the input audio to an awakening kernel, wherein the awakening kernel outputs an awakening result based on the input audio; receiving the awakening result and storing the awakening result into a data cache queue; the awakening result is sent to the websocket client in the upper computer through the websocket server, the awakening result is operated on the lower computer through voice interaction, and the result is transmitted to the upper computer to be displayed, so that the stability of the existing program of the upper computer equipment can be guaranteed, and the communication with the lower computer is achieved through the websocket service, and the websocket server is fast, convenient and high in flexibility.

Description

Voice interaction method and device for upper computer and lower computer

Technical Field

The invention belongs to the technical field of voice interaction, and particularly relates to a voice interaction method and device for an upper computer and a lower computer.

Background

At present, a plurality of companies successively put forward single technologies such as voice awakening, voice recognition, natural language understanding, dialogue management, voice synthesis and the like, basic voice interaction capacity is provided for users to develop voice products, and because the voice technologies have simple interaction capacity and need the users to realize voice interaction logic, the full-link voice dialogue system is put forward on the basis, and the workload of developers is reduced.

Voice wakeup is technically called keyword spotting (KWS) and is the real-time detection of speaker specific segments in a continuous speech stream. The 'real-time' of detection is a key point, and the voice awakening aims to activate the equipment from a dormant state to a running state, so that the awakening words can be detected immediately after being spoken, and the user experience is better.

The speech recognition mainly converts the speech content sent by a person into text information which can be read in by a computer, and has two working modes: a recognition mode and a command mode. The speech recognition program may also be implemented using different types of programs depending on the two modes. The working principle of the recognition mode is as follows: the engine system directly provides a word stock and a recognition template stock in the background, and any system does not need to further change recognition grammar and only needs to rewrite according to a main program source code provided by the recognition engine. The command pattern is relatively difficult to implement and the dictionary must be written by the programmer himself, programmed, and finally processed and corrected according to the phonetic dictionary. The recognition mode is different from the command mode in the largest way, namely, the programmer checks and modifies the codes according to the dictionary content.

Natural Language processing is an important means for realizing man-machine Natural Language communication, and includes two parts, Natural Language Understanding (NLU) and Natural Language Generation (NLG), which enables a computer to understand the meaning of a Natural Language text and express a given intention, thought, and the like in the Natural Language text. Natural language understanding is the building of a computer model, which is based on linguistics, fusing disciplines such as logics, psychology and computer disciplines, and attempts to solve the following problems: how does the language organize to transmit information? How does a person in turn obtain information from a series of language symbols? The alternative expression is to obtain the semantic representation of the natural language through the analysis of grammar, semantics and pragmatics, and understand the intention expressed by the natural language text. The natural language generation is a branch of artificial intelligence and computational linguistics, and a corresponding language generation system is a computer model based on language information processing, and the working process of the language generation system is opposite to natural language analysis, namely, a text is generated by selecting and executing certain semantic and grammatical rules from an abstract concept level.

The speech synthesis, also called as a text-to-speech technology, can convert any text information into standard and smooth speech in real time for reading, and is equivalent to mounting an artificial mouth on a machine. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert character information into audible sound information, namely, to enable a machine to speak like a person.

In the process of implementing the present application, the inventors found that the above technology has at least the following defects:

since voice wake-up, voice recognition, natural language understanding, and voice synthesis technologies are single technologies and can only provide a certain functional requirement, developers must embed various technologies into a project to realize a human-computer interaction function and develop an application program.

However, the implementation of such a voice interaction method requires a developer to perform a heavy development task from inputting audio data to giving a recognition result, and then perform natural language processing to give a semantic result until a dialog result is synthesized, thereby completing a round of man-machine interaction. Therefore, developers need to undertake a large amount of work, which not only causes low working efficiency, but also consumes more energy, so that the company provides a full-link conversation management system integrating speech recognition, semantic understanding and speech synthesis technologies based on a DUI platform, and the developers only need to input audio data to obtain synthesized audio data. Therefore, the workload of a developer is reduced, and the development efficiency is obviously improved.

However, since the usage mode must be bound with the client program, it has a certain influence on the existing program of the client device. Because the design mode has the problem of poor flexibility, the requirement of customers with high equipment stability can not be met obviously, and the development voice of the customers is limited secondarily.

Disclosure of Invention

The embodiment of the invention provides a voice interaction method and device for an upper computer and a lower computer, which are used for solving at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides a voice interaction method for an upper computer and a lower computer, including: the lower computer responds to the input audio of the user and judges whether the lower computer is in an awakening state or not; if the input audio is not in the awakening state, sending the input audio to an awakening kernel, wherein the awakening kernel outputs an awakening result based on the input audio; receiving the awakening result and storing the awakening result into a data cache queue; and sending the awakening result to a websocket client in the upper computer through the websocket server.

In a second aspect, an embodiment of the present invention provides a voice interaction device for an upper computer and a lower computer, including: the judging module is configured to respond to the input audio of the user by the lower computer and judge whether the lower computer is in an awakening state or not; the wake-up module is configured to send the input audio to a wake-up kernel if the input audio is not in a wake-up state, wherein the wake-up kernel outputs a wake-up result based on the input audio; the receiving cache module is configured to receive the awakening result and store the awakening result into a data cache queue; and the sending module is configured to send the awakening result to the websocket client in the upper computer through the websocket server.

In a third aspect, a computer program product is provided, the computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method for voice interaction between an upper computer and a lower computer according to the first aspect.

In a fourth aspect, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.

The method provided by the embodiment of the application integrates various voice interaction technologies by adopting the offline full-link voice interaction method of the upper computer and the lower computer, the voice interaction is operated on the lower computer, and the result is transmitted to the upper computer for display, so that the stability of the existing program of the upper computer equipment can be ensured, and the communication with the lower computer is realized through websocket service, and the method is fast, convenient and high in flexibility.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a voice interaction method for an upper computer and a lower computer according to an embodiment of the present invention;

fig. 2 is a flowchart of another voice interaction method for an upper computer and a lower computer according to an embodiment of the present invention;

fig. 3 is a flowchart of another voice interaction method for an upper computer and a lower computer according to an embodiment of the present invention;

fig. 4 is a flowchart of another voice interaction method for an upper computer and a lower computer according to an embodiment of the present invention;

fig. 5 is a schematic view of a voice interaction process of an upper computer and a lower computer according to a specific embodiment of a voice interaction method scheme for the upper computer and the lower computer according to the embodiment of the present invention;

fig. 6 is a block diagram of a voice interaction device for an upper computer and a lower computer according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of an embodiment of a voice interaction method for an upper computer and a lower computer according to the present invention is shown, where the upper computer and the lower computer establish a connection through a websocket, the lower computer includes a websocket server, and the upper computer includes a websocket client.

Among them, WebSocket is a protocol for full duplex communication over a single TCP connection. The WebSocket communication protocol was specified by the IETF as standard RFC 6455 in 2011 and is supplemented by RFC 7936. The WebSocket API is also standardized by W3C.

WebSocket enables data exchange between the client and the server to be simpler, and allows the server to actively push data to the client. In the WebSocket API, the browser and the server only need to complete one handshake, and persistent connection can be directly established between the browser and the server, and bidirectional data transmission is carried out.

Many websites use polling to implement push technology. Polling is the HTTP request by the browser to the server at a specific time interval (e.g., every 1 second), and then the server returns the most recent data to the client's browser. This conventional model brings with it obvious disadvantages that the browser needs to send requests to the server continuously, however, HTTP requests may contain a long header, where the really valid data may be only a small part, which obviously wastes much resources such as bandwidth.

The effect of polling with the newer technology is Comet. This technique, while capable of two-way communication, still requires repeated requests. Also in Comet, the long links that are commonly used consume server resources.

In this case, HTML5 defines a WebSocket protocol, which can better save server resources and bandwidth, and can communicate in more real time.

Less control overhead. When data is exchanged between the server and the client after the connection is created, the header of the packet for protocol control is relatively small. Without the inclusion of an extension, this header size is only 2 to 10 bytes (relative to the packet length) for server-to-client content; for client-to-server content, this header also needs to be added with an additional 4-byte mask. This overhead is significantly reduced relative to HTTP requests that each time carry a complete header.

And the real-time performance is stronger. Because the protocol is full duplex, the server can actively send data to the client at any time. Compared with the HTTP request, the response can be realized only by waiting for the client to initiate the request server, so that the delay is obviously less; it can deliver data more times in a short time even in comparison with Comet and the like, which are long polls.

The connected state is maintained. Unlike HTTP, Websocket requires a connection to be created first, which makes it a stateful protocol, and then some state information can be omitted during communication. And HTTP requests may need to carry state information (e.g., authentication, etc.) at each request.

Better binary support. Websocket defines binary frames, which can handle binary content more easily than HTTP.

Extensions may be supported. Websocket defines extension, and a user can extend a protocol and realize a partially customized subprotocol. Such as part of the browser support compression, etc.

Better compression effect. With respect to HTTP compression, Websocket, with proper extension support, can follow the context of previous content, and can significantly improve compression rate when transferring similar data.

Because the upper computer is only used for displaying a voice interaction result, the requirement on equipment is low, the equipment with a screen can be used as the upper computer, and the lower computer needs to run all algorithms, the requirement on the equipment is high, the memory (more than 4G) and the multi-core are realized, and because the PLC or the single chip has low performance and cannot meet the requirement, the android system rk3399 box or the ubuntu system box with high performance on the market is recommended to be used as the lower computer, so that related equipment elements and driving devices are controlled.

For example, the upper computer may be a user terminal that needs to provide a voice service, and the lower computer may be an android system rk3399 box or an ubuntu system box.

As shown in fig. 1, in step 101, in response to an input audio of a user, it is determined whether the user is in an awake state;

in step 102, if the input audio is not in the wake-up state, sending the input audio to a wake-up kernel, wherein the wake-up kernel outputs a wake-up result based on the input audio;

in step 103, receiving the wake-up result, and storing the wake-up result in a data buffer queue;

in step 104, the wakening result is sent to the websocket client in the upper computer through the websocket server.

In this embodiment, for step 101, after receiving an input audio of a user, a voice interaction device of an upper computer and a lower computer can determine whether the upper computer is in an awake state, that is, whether the upper computer is in a sleeping or dormant state, and the voice interaction device is not yet awake, where the awake state means that a device is already awake and is already in a voice interaction state with the user, and although all algorithms are running in the lower computer, if an awakening word audio is sent to an awakening algorithm of the lower computer, the device is awakened, the awakening state is immediately sent to the upper computer, and the lower computer stores the awakening state, and at this time, the awakening states of the upper computer and the lower computer are consistent, so that whether the upper computer is in the awakening state can be determined;

for step 102, if the voice interaction device determines that the upper computer is not in the wakeup state, which indicates that the upper computer is not in the state capable of interacting with the user voice, the wakeup operation needs to be performed first, and a wakeup process needs to be performed, so that the input audio of the user can be sent to a wakeup kernel for wakeup determination, wherein the wakeup kernel outputs a wakeup result based on the input audio;

for step 103, if the voice interaction device receives the awakening result from the awakening kernel, storing the awakening result into a data cache queue, and waiting to be sequentially sent to the upper computer;

and step 104, sequentially sending the awakening results in the data cache queue to the websocket client in the upper computer through the websocket server in the lower computer. Further, if the awakening result is awakenable, the upper computer is awakened, and if the awakening result is not awakenable, the upper computer is not awakened, which is not described herein again.

For example, the upper computer is a microcomputer, only or at least one voice awakening result is needed, the scheme can be adopted under the condition that the awakening state judgment and the awakening operation are realized on the lower computer, then the final awakening result is sent to the upper computer, and the upper computer does not need to be provided with an awakening related device and only needs to receive the awakening result.

In the scheme of the embodiment, the lower computer judges whether the lower computer is in the awakening state after receiving the input audio of the user and acquiring the input audio, if the lower computer is not in the awakening state, the lower computer obtains the awakening operation through the awakening kernel of the lower computer to generate an awakening result, then the awakening result is stored in the data cache queue, and then the awakening result is sent to the websocket client in the upper computer through the websocket server in the lower computer. Therefore, the complicated operation can be realized by the lower computer, and the upper computer only needs to receive the awakening result, so that the stability of the existing program of the upper computer equipment can be ensured, and the communication with the lower computer is realized through websocket service, so that the operation is quick, convenient and high in flexibility.

Please refer to fig. 2, which shows a flowchart of another voice interaction method for an upper computer and a lower computer according to an embodiment of the present application. The flowchart is mainly a flow of the embodiment defined for the case where the upper computer receives only the ASR result.

As shown in fig. 2, in step 201, in response to the input audio of the user, it is determined whether the user is in an awake state;

in step 202, if the mobile terminal is in the wake-up state, the input audio is input to a voice recognition service, wherein the voice recognition service outputs a voice recognition result based on the input audio;

in step 203, receiving the voice recognition result, and storing the voice recognition result in a data buffer queue;

in step 204, the speech recognition result is sent to the websocket client in the upper computer through the websocket server.

In this embodiment, step 201 is the same as step 101 in the previous embodiment, and is not described herein again. For step 202, when the voice interaction device judges that the upper computer is in the awakening state, the subsequent voice recognition step can be directly carried out, the input audio of the user is sent to the recognition service, then for step 203, the recognition service stores the recognition result into the data cache queue, and for step 204, the awakening result in the data cache queue is sequentially sent to the websocket client in the upper computer through the websocket server in the lower computer.

For example, the upper computer is a microcomputer, only one voice recognition result is needed or at least one voice recognition result is needed, the scheme can be adopted under the condition, the lower computer realizes the awakening state judgment and the awakening operation, then the final voice recognition result is sent to the upper computer, and the upper computer does not need to be provided with a device related to voice recognition and only needs to receive the voice recognition result.

In the scheme of the embodiment, the lower computer receives input audio of a user, judges whether the input audio is in an awakening state after being collected, obtains an awakening operation through an awakening kernel of the lower computer to generate a voice recognition result if the input audio is not in the awakening state, stores the voice recognition result in a data cache queue, and sends the voice recognition result to a websocket client in the upper computer through a websocket server in the lower computer. Therefore, the complicated operations can be realized by the lower computer, and the upper computer only needs to receive the voice recognition result, so that the stability of the existing program of the upper computer equipment can be ensured, and the communication with the lower computer is realized through the websocket service, and the method is fast, convenient and high in flexibility.

Please refer to fig. 3, which shows a flowchart of another voice interaction method for an upper computer and a lower computer according to an embodiment of the present application. The flowchart is mainly a flow of the embodiment defined for a case where the upper computer receives only the DM result.

As shown in fig. 3, in step 301, in response to the input audio of the user, it is determined whether the device is in an awake state;

in step 302, if the mobile terminal is in the wake-up state, the input audio is input to a voice recognition service, wherein the voice recognition service outputs a voice recognition result based on the input audio;

in step 303, receiving the voice recognition result and inputting the voice recognition result to a semantic understanding service, wherein the semantic understanding service outputs a semantic understanding result based on the voice recognition result;

in step 304, receiving the semantic understanding result, and inputting the semantic understanding result to a dialogue management service, wherein the dialogue management service outputs a dialogue management result based on the semantic understanding result;

in step 305, receiving the session management result, and storing the session management result into a data buffer queue;

in step 306, the dialog management result is sent to the websocket client in the upper computer through the websocket server.

In this embodiment, step 301 is the same as step 101 in the previous embodiment, and is not described herein again. For step 302, when the voice interaction device judges that the upper computer is in an awakened state, sending input audio of a user into an identification service, then for step 303, receiving the voice identification result, inputting the voice identification result into a semantic understanding service, then for step 304, receiving the semantic understanding result generated by understanding the voice identification result through a natural language processing technology, inputting the semantic understanding result into a session management service, then for step 305, receiving the session management result, transmitting the session management result to a lower computer, storing the session management result into a data cache queue through the lower computer, and finally, for step 306, sending the session management result in the data cache queue to a websocket client in the upper computer according to a sequence through a websocket server in the lower computer.

For example, the upper computer is a microcomputer, only or at least one dialogue management result is needed, the scheme can be adopted under the condition that the audio is sent to the server side to perform voice recognition service and a natural language processing technology is adopted to obtain the dialogue management result, then the final dialogue management result is cached by the lower computer and then sent to the upper computer, and the upper computer does not need to be provided with a device related to dialogue management and only needs to receive the dialogue management result.

In the scheme of the embodiment, the lower computer judges whether the audio is in the awakening state after receiving the input audio of the user and acquiring the audio, if the audio is not in the awakening state, the audio is sent to the server to perform voice recognition service and a natural language processing technology is adopted to obtain a conversation management result, the final conversation management result is cached by the lower computer, and then the conversation management result is sent to a websocket client in the upper computer through a websocket server in the lower computer, so that the complicated operations can be realized by the lower computer, and the upper computer only needs to receive the conversation management result. Therefore, the stability of the existing program of the upper computer equipment can be ensured, and the communication with the lower computer is realized through the websocket service, so that the communication is quick, convenient and high in flexibility.

Please refer to fig. 4, which shows a flowchart of another voice interaction method for an upper computer and a lower computer according to an embodiment of the present application. The flow chart is mainly a flow of the embodiment which is limited by the broadcasting condition of the lower computer aiming at the condition that the upper computer does not receive messages.

As shown in fig. 4, in step 401, in response to the input audio of the user, it is determined whether the device is in an awake state;

in step 402, if the mobile terminal is in the wake-up state, inputting the input audio to a speech recognition service, wherein the speech recognition service outputs a speech recognition result based on the input audio;

in step 403, receiving the voice recognition result and inputting the voice recognition result to a semantic understanding service, wherein the semantic understanding service outputs a semantic understanding result based on the voice recognition result;

in step 404, receiving the semantic understanding result, and inputting the semantic understanding result to a dialog management service, wherein the dialog management service outputs a dialog management result based on the semantic understanding result;

receiving the dialogue management result and inputting the dialogue management result to a speech synthesis service in step 405, wherein the dialogue management service outputs a speech synthesis result based on the dialogue management result;

in step 406, receiving the speech synthesis result, and storing the dialogue management result into a data buffer queue;

in step 407, the voice synthesis result is broadcasted in audio.

In this embodiment, steps 401, 402, 403, and 404 are the same as

steps

301, 302, 303, and 404 in the foregoing embodiment, and are not repeated herein, for step 405, the dialog management result generated by understanding the speech recognition result through the natural language processing technology is received, the dialog management result is input to the speech synthesis service, then, for step 406, the speech synthesis result is received, the speech synthesis result is transmitted to the lower computer, the speech synthesis result is stored in the data cache queue through the lower computer, and finally, for step 407, the speech synthesis result is audio-broadcast to the user.

For example, the upper computer is a microcomputer, the situation that the upper computer does not receive messages but the lower computer broadcasts the messages can be realized by adopting the scheme, the audio is sent to the server to perform voice recognition service, a natural language processing technology is adopted to obtain a conversation management result, then the final conversation management result is subjected to voice synthesis, and finally the voice synthesis result is cached by the lower computer and is directly broadcasted to the user at the lower computer.

In the scheme of the embodiment, the lower computer judges whether the voice is in the awakening state after receiving the input audio of the user and acquiring the input audio, if the voice is not in the awakening state, the audio is sent to the server to perform voice recognition service and a natural language processing technology is adopted to obtain a conversation management result, then voice synthesis is performed on the final conversation management result, and finally the voice synthesis result is cached by the lower computer and is directly broadcasted to the user by the lower computer, so that the complicated operations can be realized by handing over the voice synthesis result to the lower computer, and the upper computer does not need to receive any message. Therefore, the stability of the existing program of the upper computer equipment can be ensured, and the communication with the lower computer is realized through the websocket service, so that the communication is quick, convenient and high in flexibility.

In the method of the above embodiment, the wakeup word further includes: and responding to the awakening word customization instruction of the user, and customizing the personalized awakening word of the user.

According to the method provided by the embodiment of the application, the awakening words are customized according to different scenes, the consistency of the awakening words is kept, and the user can further have an exclusive personalized awakening word.

The method in the above embodiment, further comprising: the upper computer has a display function, and the lower computer performs voice interaction with the user and transmits results to the upper computer for display.

According to the method, the voice interaction and the display function are completely independent, the voice interaction operates on the lower computer, and the result is transmitted to the upper computer to be displayed, so that the stability of the existing program of the upper computer equipment can be guaranteed, and the upper computer can communicate with the lower computer through websocket service, so that the method is quicker, more convenient and has higher flexibility.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

Dds (DUI Dialogue service) is a full-link dialog management system proposed by cibie corporation based on the DUI platform, integrates technologies such as voice wakeup, voice recognition, semantic understanding, voice dialog, voice synthesis, and the like, and can be a comprehensive service technology required when a developer customizes a dialog management system, such as GUI customization, version management, private cloud deployment, and the like. The technology has the advantages that the technology not only takes the dialogue function of the intelligent voice technology, but also completely customizes various services at will according to the requirements of developers. What is a DUI platform, the DUI (Dialog User Interface) is a development configuration platform for enabling a voice interaction scene for a device, and a voice interaction scene can be added to products such as hardware, a device and a mobile phone APP through the customization platform! In the process, highly available customized man-machine conversation technical service is provided for developers, core interaction capability is provided for developers of intelligent terminals, and intelligent upgrading of traditional equipment is assisted.

The inventors have found in the course of the realisation of the present application that the drawbacks of the prior art solutions are due to: the single technology provides voice basic capability, can be developed only by acting on client equipment, and integrates various single voice technologies to complete a voice interaction program; the full-link voice conversation technology integrates functions of voice awakening, voice recognition, semantic understanding, voice conversation, voice synthesis and the like, and only reduces the development workload of customers. Because the technologies only output the result to the local machine, the websocket service is not started for the access of the third-party equipment. In addition, due to the limitation of the development language of the voice technology, in order to achieve better compatibility, the application development language of the client is also limited.

The reason why is not easily conceivable:

if the existing codes of the client device programs are hardly influenced, the provider is required to improve the stability of the voice technology, the robustness of the client voice programs is good, and downtime behaviors are avoided as much as possible, which are solved by the client.

The online full-link voice interaction method adopting the upper computer and the lower computer has complex logic, needs to communicate with a server side at the lower computer, starts the websocket service at the lower computer for the access of third-party equipment, and throws a voice result to the third-party equipment, and relates to fusion (KWS, ASR, NLP, DM and TTS) of a plurality of single basic technologies and websocket communication service.

The technical problem existing in the prior art is solved through the following scheme:

firstly, the lower computer establishes communication with a server through a network protocol, and transmits a voice result of service to the lower computer, so that the problem of complex voice interaction development logic of customers is solved.

Secondly, the websocket service is established on the lower computer, so that the client communicates the upper computer with the lower computer through a websocket protocol, the requirement for ensuring the stability of the existing program of the client equipment is met, and secondly, due to the websocket standard type, the problem that the client can select a familiar development language is solved.

The scheme of the embodiment of the application is realized by the following steps:

scenario 1: the upper computer only receives Wakeup result

Inputting audio;

the audio acquisition module acquires audio;

judging whether the audio is in an awakening state or not, if not, immediately sending the audio into a Wakeup kernel, and storing Wakeup result into a data cache queue;

and the lower computer sends the Wakeup Result in the cache queue to the websocket client of the upper computer through the websocket server.

Scenario 2: the upper computer only receives ASR result (speech recognition result)

Inputting audio;

the audio acquisition module acquires audio;

judging whether the voice is in the awakening state or not, if so, immediately sending the voice to an identification service, and storing the returned ASR result into a data cache queue;

and the lower computer sends the ASR result in the cache queue to the websocket client of the upper computer through the websocket server.

Scenario 3: the upper computer only receives DM result (dialogue management result)

Inputting audio;

the audio acquisition module acquires audio;

judging whether the audio is in the awakening state or not, and if the audio is in the awakening state, immediately sending the audio to an identification service;

the server sends the ASR result into semantic service to carry out NLP;

the server sends the NLP result to the DM, and returns the DM result to the lower computer, and the lower computer immediately stores the DMresult in a data cache queue;

and the lower computer sends the DM result in the cache queue to the websocket client through the websocket server.

Scenario 4: the upper computer does not receive the message and the lower computer broadcasts the message

Inputting audio;

the audio acquisition module acquires audio;

judging whether it is in wake-up state, if so, sending audio to identification service

The server sends the ASR result into semantic service to carry out NLP;

the server sends the NLP result to the DM;

the service end sends the DM result to TTS, and the TTS result is returned to the lower computer, and the lower computer stores the TTSresult in a data cache queue;

and the lower computer sends the TTS result to the audio playing module.

Referring to fig. 6, a block diagram of a voice interaction apparatus for an upper computer and a lower computer according to an embodiment of the present invention is shown.

As shown in fig. 6, a voice interaction device 600 for an upper computer and a lower computer includes: a judging module 610, a waking module 620, a receiving and buffering module 630 and a sending module 640.

The judging module 610 is configured to respond to an input audio of a user and judge whether the mobile terminal is in an awake state; a wake-up module 620 configured to send the input audio to a wake-up core if the input audio is not in a wake-up state, wherein the wake-up core outputs a wake-up result based on the input audio; a receiving buffer module 630, configured to receive the wake-up result and store the wake-up result in a data buffer queue; the sending module 640 is configured to send the wakeup result to the websocket client in the upper computer via the websocket server.

In some optional embodiments, the determining module 610 is further configured to determine whether the mobile terminal is in the wake-up state in response to the input audio of the user; a wake-up module 620, further configured to input the input audio to a voice recognition service if the input audio is in a wake-up state, wherein the voice recognition service outputs a voice recognition result based on the input audio; a receiving buffer module 630, further configured to receive the voice recognition result, and store the voice recognition result in a data buffer queue; the sending module 640 is further configured to send the voice recognition result to the websocket client in the upper computer via the websocket server.

It should be understood that the modules recited in fig. 6 correspond to various steps in the methods described with reference to fig. 1, 2, 3, and 4. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 6, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not limited to the scheme of the present application, for example, the determining module may be described as determining whether the module is in an awake state in response to an input audio of a user, and in addition, the related functional modules may also be implemented by a hardware processor, for example, the determining module may also be implemented by a processor, and details are not described herein again.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where a computer-executable instruction is stored in the computer storage medium, and the computer-executable instruction may execute the voice interaction method for the upper computer and the lower computer in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

responding to the input audio of a user, and judging whether the user is in an awakening state;

if the input audio is not in the awakening state, sending the input audio to an awakening kernel, wherein the awakening kernel outputs an awakening result based on the input audio;

receiving the awakening result and storing the awakening result into a data cache queue;

and sending the awakening result to a websocket client in the upper computer through the websocket server.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a voice interaction device for an upper computer and a lower computer, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory remotely located from the processor, which may be connected over a network to a voice interaction device for the upper and lower computers. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute any one of the above voice interaction methods for an upper computer and a lower computer.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device includes: one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7. The equipment for the voice interaction method of the upper computer and the lower computer can also comprise: an input device 730 and an output device 740. The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7. The memory 720 is a non-volatile computer-readable storage medium as described above. The processor 710 executes various functional applications and data processing of the server by running the nonvolatile software programs, instructions and modules stored in the memory 720, that is, the method for implementing the voice interaction device between the upper computer and the lower computer according to the above embodiment of the method is realized. The input device 730 may receive input numeric or character information and generate key signal inputs related to user settings and function controls for the voice interaction device of the upper and lower computers. The output device 740 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a voice interaction device between an upper computer and a lower computer, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The utility model provides a voice interaction method for host computer and lower computer, wherein, the host computer with the lower computer passes through websocket and establishes connection, including the websocket server in the lower computer, including the websocket customer end in the host computer, work as when the host computer only need awaken up the result, the method includes:

2. The method of claim 1, wherein when the upper computer only requires a speech recognition result, the method further comprises:

if the mobile terminal is in the awakening state, inputting the input audio into a voice recognition service, wherein the voice recognition service outputs a voice recognition result based on the input audio;

receiving the voice recognition result, and storing the voice recognition result into a data cache queue;

and sending the voice recognition result to a websocket client in the upper computer through the websocket server.

3. The method of claim 1, wherein when the upper computer only requires the dialog management result, the method further comprises:

receiving the voice recognition result and inputting the voice recognition result to a semantic understanding service, wherein the semantic understanding service outputs a semantic understanding result based on the voice recognition result;

receiving the semantic understanding result, and inputting the semantic understanding result into a dialogue management service, wherein the dialogue management service outputs a dialogue management result based on the semantic understanding result;

receiving the conversation management result, and storing the conversation management result into a data cache queue;

and sending the conversation management result to a websocket client in the upper computer through the websocket server.

4. The method of claim 1, wherein when the upper computer does not require any results, the method further comprises:

receiving the dialogue management result and inputting the dialogue management result into a voice synthesis service, wherein the dialogue management service outputs a voice synthesis result based on the dialogue management result;

receiving the voice synthesis result, and storing the voice synthesis result into a data cache queue;

and carrying out audio broadcasting on the voice synthesis result.

5. The method of any of claims 1-4, wherein the wake word further comprises:

and responding to the awakening word customization instruction of the user, and customizing the personalized awakening word of the user.

6. The method according to any one of claims 1-4, wherein the upper computer has a display function, and the lower computer performs voice interaction with the user and transmits the result to the upper computer for display.

7. A voice interaction device for an upper computer and a lower computer comprises:

the judging module is configured to respond to the input audio of the user and judge whether the mobile terminal is in an awakening state;

the wake-up module is configured to send the input audio to a wake-up kernel if the input audio is not in a wake-up state, wherein the wake-up kernel outputs a wake-up result based on the input audio;

the receiving cache module is configured to receive the awakening result and store the awakening result into a data cache queue;

and the sending module is configured to send the awakening result to the websocket client in the upper computer through the websocket server.

8. The utility model provides a voice interaction device for host computer and lower computer, still includes:

the judging module is also configured to respond to the input audio of the user and judge whether the mobile terminal is in an awakening state;

the wake-up module is further configured to input the input audio to a voice recognition service if the wake-up module is in a wake-up state, wherein the voice recognition service outputs a voice recognition result based on the input audio;

the receiving and caching module is also configured to receive the voice recognition result and store the voice recognition result into a data caching queue;

and the sending module is also configured to send the voice recognition result to a websocket client in the upper computer through the websocket server.

9. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method of any of claims 1-6.

10. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.