WO2020119542A1

WO2020119542A1 - Voice interaction method, device and system

Info

Publication number: WO2020119542A1
Application number: PCT/CN2019/122934
Authority: WO
Inventors: 祝俊; 袁英灿; 王德淼; 孟伟; 吴逸超
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-12-11
Filing date: 2019-12-04
Publication date: 2020-06-18
Also published as: TW202025138A; CN111312235A; CN111312235B

Abstract

A voice interaction method, device and system. The voice interaction method comprises the following steps: sending first voice data inputted by a user so as to receive an interactive scenario confirmed on the basis of the first voice data (S410); acquiring configuration information on the basis of the interactive scenario (S420); and on the basis of the acquired configuration information, processing second voice data inputted by the user and outputting a response (S430).

Description

Voice interaction method, device and system

This application requires the priority of the Chinese patent application filed on December 11, 2018 with the application number 201811513712.8 and the invention titled "a voice interaction method, device and system", the entire contents of which are incorporated by reference in this application.

Technical field

The invention relates to the field of computer technology, in particular to a voice interaction method, device and system.

Background technique

Over the past decade or so, the Internet has continued to deepen in all areas of people’s lives, and people can conveniently conduct activities such as shopping, socializing, entertainment, and financial management through the Internet. At the same time, in order to improve the user experience, researchers have implemented many interactive solutions, such as text input, gesture input, voice input, etc. Among them, intelligent voice interaction has become a research hotspot of the new generation of interaction modes due to its convenient operation.

With the gradual development of voice interaction technology, more and more smart devices have voice wake-up function. The currently popular smart devices on the market are all configured with fixed wake-up words. When the voice data input by the user is received, it is directly judged whether the voice data input by the user matches the preset fixed wake-up words. If the two match, the smart device in the dormant state is changed to the interactive state or the working state. In this way, every time the user wants to interact with the smart device, the user must first wake up the device using a fixed wake-up word, and then enter the voice command. After the process of a voice interaction ends, if the user wants to interact with the smart device again, he needs to input a fixed wake-up word again to wake it up, and then enter a voice command.

In this way, before each voice interaction, the user needs to input a fixed wake-up word, which undoubtedly increases the number of operations of the user, thereby increasing the interaction cost and reducing the user's interactive experience. Therefore, an optimized voice interaction solution is needed.

Summary of the invention

To this end, the present invention provides a voice interaction method, device and system, in an effort to solve or at least alleviate at least one of the above problems.

According to an aspect of the present invention, there is provided a voice interaction method, including the steps of: sending first voice data input by a user to a server, so that the server confirms an interaction scenario based on the first voice data; acquiring configuration information based on the interaction scenario; And processing the second voice data input by the user based on the obtained configuration information and outputting a response.

Optionally, the method according to the present invention further includes the step of presetting configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.

Optionally, in the method according to the present invention, the step of processing the second voice data input by the user based on the configuration information and outputting a response includes: determining whether the second voice data input by the user matches the target data in the configuration information; If it matches, the state data at the current moment is obtained; the second voice data and the state data are sent to the server, so that the server recognizes the second voice data according to the state data and returns a response instruction; and outputs a response to the user according to the response instruction.

Optionally, the method according to the present invention further includes the step of receiving the third voice data input by the user: detecting whether the third voice data input by the user contains a predetermined object; and if the third voice data contains the predetermined object, enter the interaction status.

Optionally, in the method according to the present invention, the step of sending the first voice data input by the user to the server, so that the server confirms the interaction scenario according to the first voice data includes: in response to the user inputting the first voice data, obtaining the current time State data of the server; and sending the first voice data and the state data to the server, so that the server performs recognition processing on the first voice data according to the state data and returns a response command, where the response command also includes an interactive scene.

Optionally, in the method according to the present invention, the step of acquiring configuration information based on the interaction scenario further includes: outputting a response to the user according to the response instruction.

Optionally, the method according to the present invention further includes the steps of: in response to the user's request to switch the interactive scene, forward the request to the server, so that the server confirms the interactive scene to be switched; determine whether to close the interactive scene before switching; For the interactive scene, the configuration information is obtained based on the interactive scene after switching; and if the interactive scene before switching is not closed, the configuration information is obtained based on the interactive scene before switching and the interactive scene after switching.

According to another aspect of the present invention, a voice interaction method is provided, including the steps of: determining an interaction scenario based on first voice data input by a user; acquiring configuration information based on the interaction scenario; and based on the acquired configuration information The second voice data is processed and the response is output.

According to still another aspect of the present invention, there is provided a voice interaction device, including: a connection management unit adapted to receive first voice data input by a user and send it to a server, so that the server confirms the interaction scenario according to the first voice data; information acquisition The unit is adapted to acquire configuration information based on the interaction scenario; and the data processing unit is adapted to process the second voice data input by the user based on the acquired configuration information and output a response.

Optionally, the device according to the present invention further includes an information storage unit adapted to pre-store configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario.

Optionally, in the device according to the present invention, the data processing unit further includes a judgment module adapted to judge whether the second voice data input by the user matches the target data in the configuration information; the information acquisition unit is further adapted to When the data matches the target data, the current state data is obtained; the connection management unit is further adapted to send the second voice data and the state data to the server, and receive the server to perform recognition processing on the second voice data according to the state data The returned response instruction; and the connection management unit is further adapted to output a response to the user according to the response instruction.

Optionally, in the device according to the present invention, the connection management unit is further adapted to receive third voice data input by the user; the device further includes a detection unit adapted to detect whether the third voice data input by the user contains a predetermined object, and Enter the interactive state when the third voice data contains a predetermined object.

Optionally, in the device according to the present invention, the connection management unit is further adapted to forward the request to the server in response to the user's request to switch the interactive scene, so that the server confirms the interactive scene to be switched; the information acquisition unit further includes a decision module, suitable To determine whether to close the interactive scene before switching; the information acquisition unit is also adapted to obtain configuration information based on the interactive scene after switching when the interactive scene before switching is closed, and based on the pre-switching mode when the interactive scene before switching is not closed. The configuration information is obtained in the interactive scenario and the switched interactive scenario.

According to still another aspect of the present invention, there is provided a voice interaction system, including: a client, including the voice interaction device described above; and a server, adapted to receive voice data and status data from the client, and based on the status data and Voice data to determine the interaction scenario of the client.

Optionally, in the system according to the present invention, the server is further adapted to perform recognition processing on the voice data according to the status data and return a response instruction to the client.

Optionally, in the system according to the invention, the client is a smart speaker.

According to still another aspect of the present invention, there is provided a smart speaker, including: an interface unit adapted to receive first voice data input by a user; an interaction control unit adapted to determine an interaction scenario based on the first voice data input by the user, and The configuration information is acquired based on the interaction scenario, and the interaction control unit is further adapted to process the second voice data based on the configuration information and output a response.

According to yet another aspect of the present invention, there is provided a computing device including: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by at least one processor, and the program instructions include Instructions for performing any of the methods described above.

According to yet another aspect of the present invention, there is provided a readable storage medium storing program instructions, which causes the computing device to perform any of the methods described above when the program instructions are read and executed by the computing device.

According to the voice interaction method of the present invention, when the client receives the first voice data input by the user, the first voice data is forwarded to the server, and the server confirms the interaction scenario; then the client obtains configuration information according to the interaction scenario, and then interacts In the scenario, as long as the voice data input by the user matches the target data in the configuration information, the client is directly awakened for voice interaction. Compared with the existing voice interaction solution, this solution can reduce the interaction cost and improve the user experience.

The above description is only an overview of the technical solutions of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented according to the content of the specification, and in order to make the above and other objects, features and advantages of the present invention more obvious and understandable The specific embodiments of the present invention are listed below.

BRIEF DESCRIPTION

In order to achieve the above and related objectives, the following description and accompanying drawings are used to describe certain illustrative aspects, which indicate various ways in which the principles disclosed herein may be practiced, and all aspects and their equivalents are intended to Within the scope of the claimed subject matter. The above and other objects, features, and advantages of the present disclosure will become more apparent by reading the following detailed description in conjunction with the accompanying drawings. Throughout this disclosure, the same reference numbers generally refer to the same components or elements.

FIG. 1 shows a schematic diagram of a scene of a voice interaction system 100 according to an embodiment of the present invention;

2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention;

FIG. 3 shows an interaction flowchart of a voice interaction method 300 according to an embodiment of the present invention;

4 shows a schematic flowchart of a voice interaction method 400 according to another embodiment of the present invention; and

FIG. 5 shows a schematic diagram of a voice interaction device 500 according to an embodiment of the present invention.

detailed description

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

FIG. 1 shows a schematic diagram of a scene of a voice interaction system 100 according to an embodiment of the present invention. As shown in FIG. 1, the system 100 includes a client 110 and a server 120. It should be noted that the system 100 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 100 generally includes multiple clients 110 and servers 120. The present invention treats the clients included in the system 100 The number of the terminal 110 and the server 120 are not limited.

The client 110 is a smart device having a voice interaction device (for example, the voice interaction device 500 according to an embodiment of the present invention), which can receive voice instructions from a user and return voice or non-voice information to the user. A typical voice interaction device includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor. The voice interaction device may be built in the client 110, or may be used as an independent module to cooperate with the client 110 (for example, to communicate with the client 110 via an API or by other means to call functions or applications on the client 110), The embodiments of the present invention do not limit this. The client 110 may be, for example, a mobile device with a voice interaction device (eg, a smart speaker), a smart robot, a smart home appliance (including a smart TV, a smart refrigerator, a smart microwave oven, etc.), but it is not limited thereto. An application scenario of the client 110 is a home scenario, that is, the client 110 is placed in the user's home, and the user can issue a voice instruction to the client 110 to implement certain functions, such as surfing the Internet, on-demand songs, shopping, weather forecasting, and home Control of other smart home devices, etc.

The server 120 and the client 110 communicate via a network, which may be, for example, a cloud server physically located in one or more locations. The server 120 provides a recognition service for the voice data received on the client 110 to obtain a text representation of the voice data input by the user; the server 120 also obtains a representation of the user's intention based on the text representation and generates a response instruction to return to the client 110. The client 110 performs corresponding operations according to the response instruction to provide users with corresponding services, such as setting an alarm clock, making a call, sending an email, broadcasting information, playing songs, and videos. Of course, the client 110 may also output a corresponding voice response to the user according to the response instruction, which is not limited in the embodiment of the present invention.

According to some embodiments, in the client 110, the microphone of the voice interaction module continuously receives external sounds. When the user wants to use the client 110 for voice interaction, he needs to first speak the corresponding wake-up word to wake up the client 110 (more specifically, to wake up the voice interaction module in the client 110 by entering the wake-up word), so that he enters Interactive state. After the client 110 ends a voice interaction, if the user wants to use the client 110 for interaction again, the user needs to enter a wake-up word again to wake up the client 110.

The following exemplarily shows some voice interaction processes. Among them, the fixed wake-up word is set to "elf".

User: Elf.

Client: I'm here, you said.

User; I want to listen to the best song.

Client: Ok, I will play the best song for you soon.

User: Elf.

Client: I'm here, you said.

User: Adjust the sound to 50.

Client: Yes, the sound has been adjusted to 50.

User; elf, favorite this song.

Client: Ok, the song has been collected.

User: Elf, play favorites.

Client: Ok, the collection will be played for you soon.

User: Elf.

Client: I'm here, you said.

User: Next song.

Client: OK.

User: Elf, last one.

Client: OK.

As you can see from the above example, each time the user interacts, he needs to enter the wake-up word first and then the corresponding voice command. That is to say, every time the user instructs the client 110 to perform an operation, he needs to input a wake-up word once. For users, this type of interaction is too cumbersome. In order to reduce the interaction cost and reduce the user's repeated input of the wake-up word, in the system 100 according to the present invention, according to the interaction scenario in which the user interacts with the client 110 in voice interaction, one or more items that the user may use in each interaction scenario are preset The target data constitutes the configuration information in each interactive scenario. In other words, the configuration information includes interaction templates corresponding to various interaction scenarios. According to an embodiment of the present invention, in a specific interaction scenario, the user does not need to enter a wake-up word multiple times to interact with the client 110, as long as the input voice command includes the target data in the interaction scenario.

For example, in the interactive scene of listening to songs shown in the above example, the target data may be: "previous song" "next song" "collect this song" "louder" "pause playback" "continue playing "What is the current song" and so on, these target data constitute the configuration information corresponding to the interactive scene of listening to the song. It should be noted that the above description is only an example, and embodiments of the present invention are not limited thereto. In a specific embodiment, the target data may be set to "a little louder" or "turn up the volume", and so on.

According to the embodiment of the present invention, after receiving the voice data input by the user, the client 110 also obtains the status data on the client 110 at the current moment, and transmits it to the server 120 together with the voice data. The state data of the client 110 is, for example, the state where the user is operating an application or similar software on the client 110. For example, the user may be using an application to play video streaming data; for example, the user is using a social software to communicate with a specific individual; but not limited to this.

While generating the response instruction, the server 120 can also perform scene analysis based on the state data and voice data to confirm the interactive scene that the user expects to enter when inputting the voice data. For example, the user inputs voice data-"I want to watch a drama", the server 120 confirms that the music player software is currently being used on the client 110 through the status data, and the server 120 can basically determine that the user expects to enter the interactive scene for playing video. As another example, if the user inputs voice data-"What is the weather in Hangzhou now", the server 120 can basically confirm that the user expects to enter the interactive scene for viewing the weather forecast.

The server 120 returns the confirmed interaction scenario together with the response instruction to the client 110. The client 110 obtains its corresponding configuration information according to the interaction scenario. In this way, in this interaction scenario, the client 110 only needs to determine whether the voice data input by the user is consistent with the target data in the configuration information, and if they match, directly output a response.

Taking the interaction scenario of listening to songs presented in the above example as an example, in the voice interaction system 100 according to the present invention, the voice interaction process between the user and the client 110 may be optimized as:

User: Elf.

Client: I'm here, you said.

User; I want to listen to the best song.

Client: Ok, I will play the best song for you soon.

User: Adjust the sound to 50.

Client: Yes, the sound has been adjusted to 50.

User; favorite this song.

Client: Ok, the song has been collected.

User: Play favorites.

Client: Ok, the collection will be played for you soon.

User: Next song.

Client: OK.

User: Last song.

Client: OK.

As can be seen from the interaction process shown above, according to the voice interaction system 100 of the present invention, as long as the current interaction scenario is not switched out, the client 110 is always in the interactive state, and the user can directly input voice instructions to instruct the client 110 to execute The corresponding operation. In this way, the system 100 does not need to repeat the interaction process that has been previously executed (for example, the process of waking up the client 110), which reduces the interaction cost and improves the user experience.

The following uses the client 110 to be implemented as a smart speaker as an example to outline a voice interaction solution according to an embodiment of the present invention.

In addition to the basic configuration, the smart speaker according to an embodiment of the present invention further includes: an interface unit and an interactive control unit. Wherein, the interface unit receives the first voice data input by the user; the interaction control unit determines the interaction scenario according to the first voice data input by the user, and obtains configuration information based on the interaction scenario, and at the same time, the interaction control unit can also control the second voice based on the configuration information. The voice data is processed and the response is output.

For the specific description of the voice interaction process of the smart speaker, please refer to the related descriptions of FIG. 3 above and below, which will not be repeated here.

It should be noted that in other embodiments according to the present invention, the server 120 may also be implemented as other electronic devices (eg, other computing devices in the same IoT environment) connected to the client 110 through the network. Even when the client 110 has sufficient storage space and computing power, the server 120 can also be implemented as the client 110 itself.

According to the embodiments of the present invention, both the client 110 and the server 120 may be implemented by the computing device 200 as described below. FIG. 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention.

As shown in FIG. 2, in the basic configuration 202, the computing device 200 typically includes a system memory 206 and one or more processors 204. The memory bus 208 may be used for communication between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μP), a microcontroller (μC), a digital information processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and a register 216. The example processor core 214 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, the system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some embodiments, the application 222 may be arranged to execute instructions by the one or more processors 204 using the program data 224 on the operating system.

The computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (eg, output device 242, peripheral interface 244, and communication device 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A/V ports 252. The example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate via one or more I/O ports 258 and such as input devices (eg, keyboard, mouse, pen) , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate. The example communication device 246 may include a network controller 260, which may be arranged to facilitate communication with one or more other computing devices 262 via a network communication link via one or more communication ports 264.

The network communication link may be an example of a communication medium. Communication media can generally be embodied as computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and can include any information delivery media. A "modulated data signal" may be a signal in which one or more of its data set or its changes can be made in such a way as to encode information in the signal. As a non-limiting example, the communication medium may include a wired medium such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

The computing device 200 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 200 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment according to the present invention, the computing device 200 is configured to perform the voice interaction method according to the present invention. The application 222 of the computing device 200 includes multiple program instructions for executing the voice interaction method 300 according to the present invention.

FIG. 3 shows an interaction flowchart of a voice interaction method 300 according to an embodiment of the present invention. The interaction method 300 is suitable for execution in the system 100 described above. It should be noted that, in order to make the following description clearer, the voice data (or voice command) input by the user is divided into third voice data (voice data used to wake up the client 110, generally including a predetermined object/wake Words), the first voice data (voice data input by the user after the client 110 is awakened, which contains general instructions), and the second voice data (voice data input by the user after confirming the interaction scene, which generally contains target data) . However, it should be understood that they are all voice data input by the user, and the present invention is not limited thereto.

As shown in FIG. 3, the method 300 starts at step S301.

In step S301, the client 110 receives the third voice data input by the user, and detects whether it contains a predetermined object (the predetermined object is, for example, a predetermined wake-up word), and enters the interactive state if the predetermined object is included.

In the embodiment according to the present invention, the third voice data is generally used to wake up the client 110 to make it in an interactive state. It should be noted that the predetermined object may be set in advance when the client 110 leaves the factory, or may be set by the user during the process of using the client 110. The present invention does not limit the length and content of the predetermined object.

In an embodiment, the client 110 responds to the user by playing a voice when it detects that the third voice data contains a predetermined object, for example, the client 110 plays a voice-"Hello, please speak", in order to To inform the user that the client 110 is already in an interactive state and can start voice interaction.

Then in step S302, the client 110 receives the first voice data input by the user, and in response to the user input, acquires the state data of the client 110 at the current moment.

The status data of the client 110 may include any available information on the client 110. In some embodiments, the status data of the client 110 includes one or more of the following information: the client's process data, the client's application list, the application usage history data on the client, the user's personal data associated with the client, from Data obtained on at least one sensor of the client (such as the location information of the client, environmental information, etc.), and text data in the display interface of the client, but is not limited thereto.

Then in step S303, the client 110 sends the first voice data from the user and the local state data to the server 120 together.

Subsequently, in step S304, the server 120 performs recognition processing on the first voice data according to the received state data.

In the embodiment according to the present invention, the recognition processing of the first voice data by the server 120 may be divided into two parts.

First, the server 120 recognizes the first voice data through ASR (Automatic Speech Recognition) voice recognition technology. The server 120 may first represent the first voice data as text data, and then perform word segmentation processing on the text data to obtain a text representation of the first voice data (It should be noted that other methods may also be used to represent the voice data. Not limited to text representation). A typical ASR speech recognition method may be, for example, a method based on a vocal tract model and speech knowledge, a template matching method, and a method using a neural network, etc. The embodiment of the present invention does not determine which ASR method is used for speech recognition processing. Too many restrictions, any known or future known algorithms of this type can be combined with embodiments of the present invention to implement the method 300 of the present invention.

It should be noted that when the server 120 recognizes the ASR technology, it may also include some preprocessing operations on the first voice data, such as: sampling, quantizing, and removing voice data that does not contain voice content (eg, silent voice data ), frame the voice data, window processing, etc. The embodiments of the present invention are not expanded too much here.

Then, the server 120 processes the text representation in combination with the status data to understand the user's intention, and finally obtains the representation of the user's intention. In some embodiments, the server 120 may use NLP (Natural Language Processing) natural language processing method to understand the first voice data input by the user, and finally identify the user's intention, the user's intention often corresponds to the actual operation, such as Play music, view contacts, etc. In other embodiments, the server 120 may further determine the parameters of the user's intention, such as which song or which singer's song is specifically played, and so on. The embodiments of the present invention do not make too many restrictions on which NLP algorithm is used to understand user intent. Any known or future known algorithm of this type can be combined with the embodiments of the present invention to implement the method of the present invention 300.

In the embodiment according to the present invention, the server 120 determines the current interaction scene by analyzing the user's intention. The interactive scene characterizes the interactive scene that the client 110 is currently or (according to user input). The interactive scene may be, for example, a call scene, a short message scene, a music scene, a video scene, a browsing information scene, and so on.

In the embodiment according to the present invention, the server 120 generates a response instruction after performing the above recognition on the first voice data.

On the one hand, the response instruction includes a response to the user's intention and specific execution parameters. For example, the first voice data input by the user is "I want to listen to a song", and the response instruction generated by the server 120 includes a playback instruction. At the same time, the corresponding text may be included in the response instruction to reply to the voice data input by the user. For example, the response command contains text data of "OK, to be played for you". In addition, the response instruction may also include execution parameters of the playback instruction, such as a playlist, a cover of the song to be played, a download address, etc., which is not limited thereto.

On the other hand, the response command also contains interactive scenes. For example, the server 120 obtains through processing and analysis that the interactive scene corresponding to “I want to listen to a song” is a “music scene”. Then, in addition to the above parameters, the response command generated by the server 120 also includes a “music scene”.

Subsequently, in step S305, the server 120 returns the above response instruction to the client 110.

Then, in step S306, the client 110 acquires configuration information based on the interaction scenario in the received response instruction.

As mentioned above, the client 110 is preset with configuration information in each interaction scenario. The configuration information includes at least one piece of target data for use in the interactive scenario. For the specific description of the configuration information, please refer to the related description in FIG. 1 above, and no more details are provided here.

On the other hand, the client 110 outputs a response to the user according to the relevant instructions and execution parameters in the response instructions. For example, the client 110 converts the text data contained in the response instruction to voice data through TTS technology, and replies to the user by voice-"OK, will play for you"; meanwhile, the client 110 executes the playback instruction to play songs for the user . In still other embodiments, the client 110 may also download corresponding songs, covers, etc. according to the execution parameters, which will not be repeated here.

Next, in step S307, the client 110 receives the second voice data input by the user, and determines whether the second voice data input by the user matches the target data in the configuration information.

According to an embodiment of the present invention, if the second voice data matches at least one piece of target data in the configuration information, the interactive state is directly entered. That is, the user can wake up the client 110 without entering the predetermined object again. Continuing with the above example, in the music scene, the user enters the second voice data-"next song", after the client 110 determines that the second voice data matches the target data in the music scene, then directly enters the interactive state .

It should be noted that, regarding which method is used to determine whether the second voice data and the target data match, the embodiment of the present invention does not make excessive restrictions. For example, a person skilled in the art may calculate the matching degree of the second voice data and the target data in any way, and when the matching degree is higher than a preset value, it is determined that the two match.

In the interactive state, the client 110 acquires the state data at the current moment as described in step S302. I won't repeat them here.

Then in step S308, the client 110 sends the second voice data and status data to the server 120.

Next, in step S309, the server 120 recognizes the corresponding second voice data according to the received status data. In the embodiment according to the present invention, the processing of the second voice data is completely consistent with the processing of the first voice data, so for the specific content of the recognition, reference may be made to the relevant description of step S304, and no more details will be given here.

Subsequently, in step S310, the server 120 returns a response instruction to the client 110. The client 110 outputs a response to the user according to the response instruction.

After that, as long as the second voice data input by the user matches the target data in the configuration information, that is, has always been in the current interactive scene, the process from step S307 to step S310 is repeated.

According to some embodiments, the method 300 may also include a scenario of switching interactive scenes.

In step S311, in response to the user's request to switch the interactive scene, the client 110 forwards the request to the server 120.

In the embodiment according to the present invention, the user's request to switch the interactive scene can be triggered in the following two ways.

In one embodiment, the user resends the third voice data containing the predetermined object to the client 110. At this time, the client 110 detects the predetermined object, and determines that the user wants to switch the interactive scene. For example, in a music scene, the user enters-"Elf, help me check the current weather", which triggers a request to switch the interactive scene.

In another embodiment, the user switches the display interface of the client 110 so that the client 110 jumps to other applications or services. For example, in a video scene, the display interface of the client 110 is a video playback interface, and when the user switches the display interface to the display interface where the picture is taken, a request to switch the interactive scene is triggered.

In the subsequent step S312, the server 120 confirms the interactive scene to be switched, and in the subsequent step S313, returns a response instruction.

The server 120 may analyze the interactive scene that the user wants to switch according to the third voice data input by the user. For example, if the user inputs ""Elf, help me check the current weather", the server 120 can analyze the interactive scene to be switched as the weather query scene based on this.

In addition, in response to the request to switch the interactive scene, the client 110 will also collect the state data at the current time and send it to the server 120 together with the request. In this way, the server 120 can use the status data to perform scene analysis to confirm the interactive scene to be switched. For example, when the display interface of the client 110 is switched from the video playback interface to the picture shooting interface, the server 120 may analyze that the interactive scene to be switched is a picture shooting scene.

Of course, the server 120 can also combine the status data and the third voice data input by the user to perform scene analysis to confirm the interactive scene to be switched. For the specific content of this section, reference may be made to the related description in step S304 above, which is not repeated here.

After confirming the interaction scenario to be switched, the server 120 generates a corresponding response instruction to the client 110, and the response is output to the user. For example, switch to the application that the user desires to open. The relevant description of the response command can be referred to the previous description, and will not be expanded here.

Meanwhile, in step S314, the client 110 determines whether to close the interactive scene before switching.

According to the embodiment of the present invention, the client 110 determines whether to close the interaction scene before the switch through the status data. The client 110 obtains the process data being executed to make a judgment: if the process data being executed does not include the process data corresponding to the interactive scene before switching, it means that the previous process has been closed, so the interactive scene before switching is closed; If the process data being executed also includes process data corresponding to the interactive scene before switching, it means that the previous process is still being executed, so the interactive scene before switching is not closed.

Next, when it is determined that the interactive scene before switching is closed, configuration information is obtained based on the interactive scene after switching. When it is determined that the interactive scene before switching is not closed, the configuration information is obtained based on the interactive scene before switching and the interactive scene after switching. That is, while retaining the original configuration information, the configuration information corresponding to the switched interactive scene is acquired. For the content of obtaining configuration information, reference may be made to the related description of step S306 above, and details are not described herein again.

In actual application scenarios, users often listen to music while browsing information and chatting. Imagine a scenario where the user is playing a song through an audio playback application (that is, the client 110 is in a music scene). At this time, the user enters the third voice data-"Elf, I want to watch the latest news", the client 110 Open a news application on the client 110 according to the response instruction. At this time, the display interface of the client 110 jumps to the news application, but the audio playback application can still play songs in the background. The client 110 conducts judgment and analysis by collecting status data, and finally confirms that it is not necessary to close the interactive scene before switching (ie, music scene), so as to be configured based on the interactive scene before switching and the interactive scene after switching (ie, browsing information scene) information. That is, the configuration information in the music scene currently in use on the client 110 is retained, and then the configuration information in the browsing information scene is acquired. After that, the user can use the target data in the two sets of configuration information to perform voice interaction with the client 110 at the same time.

Based on the above description, through the voice interaction solution of the present invention, the client combines local status, user habits and other information to pre-set different configuration information for different interaction scenarios to support the rapid wake-up of the client in each interaction scenario, that is, Directly respond to the user's voice command without a wake-up word (ie, a predetermined object).

According to the voice interaction method 300 of the present invention, when receiving the first voice data input by the user, the client 110 forwards the first voice data to the server 120, and the server 120 confirms the interaction scenario; and then the client 110 obtains the configuration according to the interaction scenario Information, in this interaction scenario, as long as the voice data input by the user matches the target data in the configuration information, the client 110 is directly awakened for voice interaction. Compared with the existing voice interaction solution, this solution has the advantages of fast response and low cost. In addition, the server 120 performs scene analysis based on the state data on the client 110, and closely integrates the recognition of the voice data with the current state of the client 110 and the interactive scene, which can significantly improve the accuracy of recognition.

The execution of the method 300 involves various components in the system 100. For this reason, FIG. 4 shows a schematic flowchart of a voice interaction method 400 according to another embodiment of the present invention. The method 400 shown in FIG. 4 is suitable for execution in the client 110 and is a further description of the method shown in FIG. 3.

As shown in FIG. 4, the method 400 includes step S410, sending the first voice data input by the user to the server 120, so that the server 120 confirms the interaction scene according to the first voice data.

It should be noted that, after receiving the first voice data input by the user, the client 110 may also locally determine the current interaction scene according to the first voice data input by the user. For example, the client 110 confirms the current interaction scene according to the current state data (for example, the currently used application program is not limited to this). The embodiments of the present invention do not limit this.

Subsequently, in step S420, configuration information is acquired based on the interaction scenario. Then in step S430, the second voice data input by the user is processed based on the acquired configuration information and a response is output.

In addition, the method 400 further includes the step of setting configuration information in each interaction scenario in advance. According to an embodiment of the present invention, the configuration information includes at least one piece of target data for use in the interactive scenario. Through these target data, you can directly wake up the client to enter the interactive state. The target data in the configuration information can be set in combination with the state of the client itself, the user's input preferences, etc.; it can be preset in the factory or can be set by the user during the use of the client 110. There are no restrictions.

Of course, before receiving the first voice data from the client 110, a process of switching the client 110 from the sleep state to the interactive state according to the third voice data input by the user is also included. And, in the interactive state, the step of reloading the configuration information in response to the user's request to switch the interactive scene. For the description of the entire voice interaction process, reference may be made to the specific explanation in the foregoing method 300, which is limited in length, and will not be repeated here.

To further illustrate the client 110 in conjunction with the related descriptions of FIGS. 3 to 4, FIG. 5 shows a schematic diagram of a voice data recognition device 500 residing in the client 110 according to an embodiment of the present invention.

As shown in FIG. 5, in addition to the basic configuration, the voice interaction device 500 includes at least: an information storage unit 510, a connection management unit 520, an information acquisition unit 530, and a data processing unit 540.

According to an embodiment, the information storage unit 510 pre-stores configuration information in each interaction scenario, where the configuration information includes at least one piece of target data for use in the interaction scenario. The connection management unit 520 is used to implement various input/output operations of the voice interaction device 500, for example, receiving the first voice data input by the user and sending it to the server 120, so that the server 120 confirms the interaction scene according to the first voice data. The information acquisition unit 530 acquires configuration information based on the interaction scenario. The data processing unit 540 processes the second voice data input by the user based on the acquired configuration information and outputs a response.

In some embodiments, the data processing unit 540 further includes a judgment module 542 adapted to judge whether the second voice data input by the user matches the target data in the configuration information. When the second voice data matches the target data, the information acquisition unit 530 acquires the state data at the current moment. The connection management unit 520 sends the second voice data and the status data to the server 120, and receives a response instruction returned by the server 120 after performing recognition processing on the second voice data according to the status data. Finally, the connection management unit 520 also outputs a response to the user according to the response instruction.

Of course, the connection management unit 520 is also used to receive the third voice data input by the user.

The voice interaction device 500 includes a detection unit (not shown) in addition to the above-mentioned parts. The detection unit detects whether the third voice data input by the user includes a predetermined object, and the client 110 enters an interactive state when the third voice data includes the predetermined object.

In still other embodiments, the connection management unit 520 may also respond to the user's request to switch the interactive scene, and forward the request to the server 120, so that the server 120 confirms the interactive scene to be switched. Further, the information acquisition unit 530 further includes a decision module 532, which is used to determine whether to close the interaction scene before the handover. If it is determined after judgment that the interaction scene before switching is to be closed, the information acquisition unit 530 obtains configuration information based on the interaction scene after switching; if it is determined that the interaction scene before switching is not to be closed after judgment, the information acquisition unit 530 is based on pre-switching The configuration information is obtained in the interactive scenario and the switched interactive scenario.

For a specific description of the operations performed by the various parts in the voice interaction device 500, refer to the related content of FIG. 1, FIG. 3, and FIG. 4 above, and details are not described here.

The various technologies described herein may be implemented in combination with hardware or software, or a combination thereof. Thus, the method and apparatus of the present invention, or some aspects or parts of the method and apparatus of the present invention, may adopt embedded tangible media, such as a removable hard disk, U disk, floppy disk, CD-ROM, or any other machine-readable storage medium In the form of program code (ie, instructions) in which when the program is loaded into a machine such as a computer and executed by the machine, the machine becomes a device for practicing the invention.

In the case where the program code is executed on a programmable computer, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), and at least one input device, And at least one output device. Wherein, the memory is configured to store program code; the processor is configured to execute the method of the present invention according to the instructions in the program code stored in the memory.

By way of example, and not limitation, readable media includes readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media generally embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided here, the algorithm and display are not inherently related to any particular computer, virtual system, or other devices. Various general-purpose systems can also be used with examples of the present invention. From the above description, the structure required to construct such a system is obvious. In addition, the present invention is not directed to any particular programming language. It should be understood that various programming languages can be used to implement the contents of the present invention described herein, and the above descriptions of specific languages are for disclosure of the best embodiments of the present invention.

The specification provided here explains a lot of specific details. However, it can be understood that the embodiments of the present invention can be practiced without these specific details. In some instances, well-known methods, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.

Similarly, it should be understood that in order to streamline the disclosure and help understand one or more of the various inventive aspects, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, Figure, or its description. However, the disclosed method should not be interpreted as reflecting the intention that the claimed invention claims more features than those explicitly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Therefore, the claims following a specific embodiment are hereby expressly incorporated into the specific embodiment, wherein each claim itself serves as a separate embodiment of the present invention.

Those skilled in the art should understand that the modules or units or components of the device in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices. The modules in the foregoing examples may be combined into one module or, in addition, may be divided into multiple sub-modules.

Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and in addition, they may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method so disclosed may be adopted in any combination All processes or units of equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose.

In addition, those skilled in the art can understand that although some of the embodiments described herein include certain features included in other embodiments but not other features, the combination of features of different embodiments is meant to be within the scope of the present invention And form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Furthermore, some of the described embodiments are described herein as methods or combinations of method elements that can be implemented by a processor of a computer system or by other devices that perform the described functions. Therefore, a processor having the necessary instructions for implementing the method or method element forms a device for implementing the method or method element. Furthermore, the elements of the device embodiments described herein are examples of devices for implementing the functions performed by the elements for the purpose of implementing the invention.

As used herein, unless otherwise specified, the use of ordinal words "first", "second", "third", etc. to describe ordinary objects merely indicates different instances involving similar objects and is not intended to imply such The objects described must have a given order in time, space, order, or in any other way.

Although the invention has been described in terms of a limited number of embodiments, benefiting from the above description, those skilled in the art understand that other embodiments are conceivable within the scope of the invention thus described. In addition, it should be noted that the language used in this specification is mainly selected for readability and teaching purposes, not for explaining or limiting the subject matter of the present invention. Therefore, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. With regard to the scope of the present invention, the disclosure made to the present invention is illustrative rather than limiting, and the scope of the present invention is defined by the appended claims.

Claims

A voice interaction method, including steps:

Sending the first voice data input by the user to the server, so that the server confirms the interaction scene according to the first voice data;

Acquiring configuration information based on the interaction scenario; and

Process the second voice data input by the user based on the obtained configuration information and output a response.
The method of claim 1, further comprising the steps of:

Set the configuration information in each interactive scenario in advance,

Wherein, the configuration information includes at least one piece of target data for use in the interactive scenario.
The method of claim 2, wherein the step of processing the second voice data input by the user based on the configuration information and outputting a response includes:

Determine whether the second voice data input by the user matches the target data in the configuration information;

If it matches, the status data at the current moment is obtained;

Sending the second voice data and status data to the server, so that the server performs recognition processing on the second voice data according to the status data and returns a response instruction; and

Output a response to the user according to the response instruction.
The method according to any one of claims 1 to 3, wherein before the step of sending the first voice data input by the user to the server so that the server confirms the interaction scene according to the first voice data, the method further includes Steps for receiving the third voice data input by the user:

Detecting whether the third voice data input by the user contains a predetermined object; and

If the third voice data contains a predetermined object, it enters an interactive state.
The method according to any one of claims 1 to 4, wherein the step of sending the first voice data input by the user to the server, so that the server confirms the interaction scenario according to the first voice data includes:

In response to the user inputting the first voice data, acquiring status data at the current moment; and

Sending the first voice data and the status data to the server, so that the server performs recognition processing on the first voice data according to the status data and returns a response instruction,

The response instruction also includes an interactive scene.
The method of claim 5, wherein the step of acquiring configuration information based on the interaction scenario further comprises:

Output a response to the user according to the response instruction.
The method according to any one of claims 1-6, further comprising the steps of:

In response to the user's request to switch the interactive scene, forward the request to the server, so that the server confirms the interactive scene to be switched;

Determine whether to close the interactive scene before switching;

If the interactive scene before switching is closed, configuration information is obtained based on the interactive scene after switching; and

If the interactive scene before switching is not closed, configuration information is obtained based on the interactive scene before switching and the interactive scene after switching.
A voice interaction method, including steps:

Determine the interaction scenario according to the first voice data input by the user;

Acquiring configuration information based on the interaction scenario; and

Process the second voice data input by the user based on the obtained configuration information and output a response.
A voice interaction device, including:

The connection management unit is adapted to receive the first voice data input by the user and send it to the server, so that the server confirms the interaction scenario according to the first voice data;

An information acquisition unit adapted to acquire configuration information based on the interaction scenario; and

The data processing unit is adapted to process the second voice data input by the user based on the acquired configuration information and output a response.
The apparatus of claim 9, further comprising:

The information storage unit is adapted to store configuration information in each interactive scenario in advance, wherein the configuration information includes at least one piece of target data for use in the interactive scenario.
A voice interaction system, including:

Client, including the voice interaction device according to claim 9 or 10; and

The server is adapted to receive voice data and status data from the client, and determine interaction scenarios of the client based on the status data and voice data.
The system of claim 11, wherein:

The server is further adapted to recognize and process the voice data according to the status data and return a response instruction to the client.
The system according to claim 11 or 12, wherein the client is a smart speaker.
A smart speaker, including:

The interface unit is adapted to receive the first voice data input by the user;

The interaction control unit is adapted to determine an interaction scenario based on the first voice data input by the user and obtain configuration information based on the interaction scenario, the interaction control unit is further adapted to process the second voice data based on the configuration information, and Output response.
A computing device, including:

At least one processor; and

A memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, and the program instructions include instructions for performing the method according to any one of claims 1-8 .
A readable storage medium storing program instructions, when the program instructions are read and executed by a computing device, causing the computing device to perform the method according to any one of claims 1-8.