CN112037786A

CN112037786A - Voice interaction method, device, equipment and storage medium

Info

Publication number: CN112037786A
Application number: CN202010896268.3A
Authority: CN
Inventors: 金鹿; 黄荣升; 张刚; 薛军涛; 朱凯华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-04
Anticipated expiration: 2040-08-31
Also published as: CN112037786B

Abstract

The application discloses a voice interaction method, a voice interaction device, voice interaction equipment and a storage medium, and relates to the fields of smart home and artificial intelligence. The specific implementation scheme is as follows: monitoring the voice of a user in real time; recognizing the voice, and determining whether the voice comprises a first preset word or not; in response to determining that the voice includes a first preset word, determining whether context information located in the first preset word in the voice includes a second preset word; in response to determining that the context information of the first preset word includes a second preset word, performing intent recognition on the context information of the second preset word; and controlling the equipment according to the intention recognition result so as to respond to the user. The realization mode enables the interactive process of the equipment to become more adaptive and the user experience to be more friendly.

Description

Voice interaction method, device, equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the fields of smart homes and artificial intelligence, and particularly relates to a voice interaction method, device, equipment and storage medium.

Background

With the continuous development of artificial intelligence technology, terminal device control systems based on voice awakening are also continuously developed, wherein the voice awakening is used as an entrance for controlling the terminal device, and is gradually a research hotspot in the technical field of artificial intelligence.

At present, a user can wake up the terminal device through voice and control the terminal device to execute corresponding operations, so that a lot of convenience is brought. However, since different users have different wake-up habits, how to adapt the terminal device to different wake-up habits is a problem to be solved.

Disclosure of Invention

A voice interaction method, apparatus, device and storage medium are provided.

According to a first aspect, there is provided a voice interaction method, comprising: monitoring the voice of a user in real time; recognizing the voice, and determining whether the voice comprises a first preset word or not; in response to determining that the voice includes a first preset word, determining whether context information located in the first preset word in the voice includes a second preset word; in response to determining that the context information of the first preset word includes a second preset word, performing intent recognition on the context information of the second preset word; the control device responds to the user based on the intention recognition result.

According to a second aspect, there is provided a voice interaction apparatus comprising: a real-time monitoring unit configured to monitor a voice of a user in real time; a voice recognition unit configured to recognize a voice, and determine whether the voice includes a first preset word; a determining unit configured to determine whether context information located in a first preset word in the voice includes a second preset word in response to determining that the voice includes the first preset word; an intention recognition unit configured to perform intention recognition on context information of a second preset word in response to a determination that the context information of the first preset word includes the second preset word; and a device control unit configured to control the device to respond to the user according to the intention recognition result.

According to a third aspect, there is provided a voice interaction electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.

According to the technology of the application, the technical problem that the existing terminal equipment awakening method cannot well adapt to awakening habits of different users is solved, so that the adaptability of the equipment interaction process becomes stronger, and the user experience is more friendly.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a voice interaction method according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a voice interaction method according to the present application;

FIG. 4 is a flow diagram of another embodiment of a voice interaction method according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of a voice interaction device according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a voice interaction method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the voice interaction method or voice interaction apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

intelligent end devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the intelligent

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

intelligent terminal device

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a voice recognition application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the intelligent

terminal devices

101, 102, 103.

The intelligent

terminal devices

101, 102, 103 may be hardware or software. When the

smart terminal

101, 102, 103 is hardware, it can be various electronic devices with voice recognition function, including but not limited to smart phones, smart speakers, smart robots, etc. When the

smart terminal

101, 102, 103 is software, it can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes speech acquired by the

smart terminal apparatuses

101, 102, 103. The backend server may analyze and otherwise process the data such as the voice, and feed back the processing result (e.g., response data) to the

smart terminal apparatus

101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the voice interaction method provided by the embodiment of the present application is generally executed by the intelligent

terminal devices

101, 102, and 103. Accordingly, the voice interaction device is generally disposed in the intelligent

terminal apparatus

101, 102, 103.

It should be understood that the number of intelligent end devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of intelligent end devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a voice interaction method according to the present application is shown. The voice interaction method of the embodiment comprises the following steps:

step 201, monitoring the voice of the user in real time.

In this embodiment, the execution subject of the voice interaction method (for example, the intelligent

terminal devices

101, 102, 103 shown in fig. 1) may monitor the voice of the user in real time. Specifically, the execution main body may be provided with a microphone array for collecting the voice of the user in real time and analyzing the voice.

Step 202, recognizing the voice, and determining whether the voice includes a first preset word.

After the execution main body collects the voice of the user, the voice can be recognized, and whether the voice comprises a first preset word or not is determined. Specifically, the execution main body may perform speech recognition on the speech to obtain a text corresponding to the speech. Then, whether the first preset word is included in the text is determined. Here, the first preset word may be a part of a wake-up word of the smart terminal device, and may be, for example, the first two words of the wake-up word. For example, the wake word is small a, and the first predetermined word may be small a.

Step 203, in response to determining that the voice includes the first preset word, determining whether the context information located in the first preset word in the voice includes the second preset word.

In this embodiment, if the voice includes the first preset word, the execution main body may determine whether the context information located in the first preset word in the voice includes the second preset word. Here, the second preset word may be another part of the wake-up word, for example, the last two words of the wake-up word. It is understood that the first predetermined term and the second predetermined term may be the same or different.

In response to determining that the context information of the first preset word includes a second preset word, intent recognition is performed on the context information of the second preset word, step 204.

The execution subject may perform intent recognition on context information of a second preset word in the voice if the context information of the first preset word includes the second preset word. Specifically, the executing agent may perform intent recognition using an existing algorithm. For example, the executing body may input the following information of the second preset word in the speech into a preset intention recognition model, and output the intention recognition model as the recognized intention.

In step 205, the control device responds to the user based on the intent recognition result.

In this embodiment, the execution subject may control the device according to the intention recognition result to respond to the user. For example, if the intention recognition result is turning off the light, the execution subject may generate a turn-off command to turn off the light, thereby realizing a response to the user.

With continued reference to FIG. 3, a schematic diagram of one application scenario of the voice interaction method according to the present application is shown. In the application scenario of fig. 3, the user says "small a and small B turn off the light" to the smart speaker. After the intelligent sound box receives the voice, the voice is recognized to comprise a first preset word ' small A ', and the context information of the small A ' comprises a second preset word ' small B '. Then the following information "light off" of the second preset word "small B" is subject to intent recognition, resulting in the intent of "light off". Then, the smart speaker sends a turn-off command to the lamp to turn off the electric lamp.

The voice interaction method provided by the embodiment of the application enables the interaction process of the equipment to become more adaptive and user experience to be more friendly.

With continued reference to FIG. 4, a flow 400 of another embodiment of a voice interaction method according to the present application is shown. As shown in fig. 4, the voice interaction method of the present embodiment may include the following steps:

step 401, monitoring the voice of the user in real time.

Step 402, recognizing the voice, and determining whether the voice includes a first preset word.

In step 403, in response to determining that the voice includes the first preset word, determining whether the context information located in the first preset word in the voice includes the second preset word.

In response to determining that the context information of the first preset word includes a second preset word, intent recognition is performed on the context information of the second preset word, step 404.

The principle of steps 401 to 404 is similar to that of steps 201 to 204, and is not described herein again.

In response to determining that the context information for the first preset term does not include the second preset term, intent recognition is performed on the context information for the first preset term, step 405.

In this embodiment, if the context information of the first preset word in the voice does not include the second preset word, the execution subject may perform intent recognition on the context information of the first preset word. If the context information of the first preset word does not include the second preset word, it indicates that the user likes to awaken the intelligent device by using the first two words of the preset awakening word, and the intent recognition can be performed on the context information of the first preset word.

And 406, controlling the equipment to respond to the user according to the intention recognition result.

Step 407, determining the interaction habit of the user to the device according to the voice.

In this embodiment, the execution subject may also determine the interaction habit of the user on the device according to the voice. Here, the interaction habit may be understood as a control instruction that is most often used when a user interacts with a device. For example, the interaction habit is a first preset word + an instruction, or a first preset word + a second preset word + a pause + an instruction.

In some optional implementation manners of this embodiment, the step 407 may be specifically implemented by the following steps not shown in fig. 4: in response to determining that the context information of the first preset word includes the second preset word, determining that a combination of the first preset word and the second preset word is a common wake-up word of the user for the device.

In this implementation manner, if it is determined that the context information of the first preset word includes the second preset word, the execution main body considers that the wake-up word used by the user is the first preset word + the second preset word. The execution subject may use a combination of the first preset word and the second preset word as a common wake-up word of the user for the device.

In some optional implementation manners of this embodiment, the step 407 may be specifically implemented by the following steps not shown in fig. 4: and in response to determining that the context information of the first preset word does not include the second preset word, determining that the first preset word is a common wake-up word of the user to the device.

In this implementation, if the execution main body determines that the context information of the first preset word does not include the second preset word, the execution main body determines that the user frequently uses the first preset word to wake up the device, and thus the first preset word may be used as a common wake up word of the device by the user.

In some optional implementation manners of this embodiment, the step 407 may be specifically implemented by the following steps not shown in fig. 4: from the speech, a dwell time after the common wake up word is determined.

In this implementation, the execution subject may also determine a dwell time after the wakeup word. If the pause time is long, the user can be responded to immediately, for example, the voice "I am" is output, in order to improve the user experience.

And step 408, outputting response information according to the interaction habit.

The execution subject may output the response information according to the interaction habit. For example, if the interaction habit is the first preset word + the second preset word + the pause + the instruction, the execution main body may immediately output the response information after the user has spoken the first preset word + the second preset word. Then, after the user speaks the instruction, the device is controlled.

According to the voice interaction method provided by the embodiment of the application, the response information can be output according to the interaction habit of the user to the equipment, the adaptability of the equipment is improved, and the user experience is improved.

With further reference to fig. 5, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of a voice interaction apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 5, the voice interaction apparatus 500 of the present embodiment includes: a real-time monitoring unit 501, a voice recognition unit 502, a judgment unit 503, an intention recognition unit 504, and an apparatus control unit 505.

A real-time monitoring unit 501 configured to monitor the voice of the user in real time.

The speech recognition unit 502 is configured to recognize the speech and determine whether the speech includes a first preset word.

The determining unit 503 is configured to determine whether context information located in the first preset word in the speech includes a second preset word in response to determining that the speech includes the first preset word.

An intent recognition unit 504 configured to perform intent recognition on context information of a second preset word in response to determining that the context information of the first preset word includes the second preset word.

A device control unit 505 configured to control the device to respond to the user according to the intention recognition result.

In some optional implementations of the present embodiment, the intent recognition unit 504 may be further configured to: in response to determining that the context information of the first preset word does not include the second preset word, intent recognition is performed on the context information of the first preset word.

In some optional implementations of this embodiment, the apparatus 500 may further include an interaction habit determining unit, not shown in fig. 5, configured to: determining the interaction habit of a user to the equipment according to the voice; and outputting response information according to the interaction habit.

In some optional implementations of the present embodiment, the interaction habit determination unit is further configured to: in response to determining that the context information of the first preset word includes the second preset word, determining that a combination of the first preset word and the second preset word is a common wake-up word of the user for the device.

In some optional implementations of the present embodiment, the interaction habit determination unit is further configured to: and in response to determining that the context information of the first preset word does not include the second preset word, determining that the first preset word is a common wake-up word of the user to the device.

In some optional implementations of the present embodiment, the interaction habit determination unit is further configured to: from the speech, a dwell time after the common wake up word is determined.

It should be understood that units 501 to 505 recited in the voice interaction apparatus 500 correspond to respective steps in the method described with reference to fig. 2. Thus, the operations and features described above for the voice interaction method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device for performing a voice interaction method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of performing voice interaction provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the method of performing voice interaction provided by the present application.

The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for performing voice interaction in the embodiment of the present application (for example, the real-time monitoring unit 501, the voice recognition unit 502, the determination unit 503, the intention recognition unit 504, and the device control unit 505 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 602, that is, implementing the method of performing voice interaction in the above-described method embodiments.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device performing voice interaction, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, which may be connected through a network to an electronic device performing voice interactions. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device performing the voice interaction method may further include: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603 and the output device 604 may be connected by a bus or other means, and fig. 6 illustrates the connection by a bus as an example.

The input device 603 may receive input numeric or character information and generate key signal inputs related to performing user settings and function control of the voice interactive electronic apparatus, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the technical problem that the existing terminal equipment awakening method cannot well adapt to awakening habits of different users is solved, so that the adaptability of the interaction process of the equipment becomes stronger, and the user experience is more friendly.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A voice interaction method, comprising:

monitoring the voice of a user in real time;

recognizing the voice, and determining whether the voice comprises a first preset word or not;

in response to determining that the voice includes a first preset word, determining whether context information located in the first preset word in the voice includes a second preset word;

in response to determining that the context information of the first preset word includes a second preset word, performing intent recognition on the context information of the second preset word;

and controlling the equipment to respond to the user according to the intention recognition result.

2. The method of claim 1, wherein the method further comprises:

in response to determining that the contextual information of the first preset term does not include a second preset term, intent recognition is performed on the contextual information of the first preset term.

3. The method of claim 2, wherein the method further comprises:

determining the interaction habit of the user and the equipment according to the voice;

and outputting response information according to the interaction habit.

4. The method of claim 3, wherein said determining interaction habits of the user with the device from the speech comprises:

in response to determining that the context information of the first preset word includes the second preset word, determining that a combination of the first preset word and the second preset word is a common wake-up word of the user for the device.

5. The method of claim 3, wherein said determining interaction habits of the user with the device from the speech comprises:

in response to determining that the context information of the first preset word does not include a second preset word, determining that the first preset word is a common wake-up word of the user for the device.

6. The method of claim 4 or 5, wherein the determining interaction habits of the user on the device from the speech comprises:

and determining the pause time after the common awakening word according to the voice.

7. A voice interaction device, comprising:

a real-time monitoring unit configured to monitor a voice of a user in real time;

a voice recognition unit configured as a voice recognition unit configured to recognize the voice and determine whether the voice includes a first preset word;

a determining unit configured to determine whether context information located in a first preset word in the voice includes a second preset word in response to determining that the voice includes the first preset word;

an intention recognition unit configured to perform intention recognition on context information of a second preset word in response to a determination that the context information of the first preset word includes the second preset word;

and a device control unit configured to control the device to respond to the user according to the intention recognition result.

8. The apparatus of claim 7, wherein the intent recognition unit is further configured to:

9. The apparatus of claim 7, wherein the apparatus further comprises an interaction habit determination unit configured to:

determining the interaction habit of the user on the equipment according to the voice;

and outputting response information according to the interaction habit.

10. The apparatus of claim 9, wherein the interaction habit determination unit is further configured to:

11. The apparatus of claim 9, wherein the interaction habit determination unit is further configured to:

12. The apparatus according to claim 10 or 11, wherein the interaction habit determination unit is further configured to:

13. A voice interactive electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.