CN111696534B

CN111696534B - Voice interaction device and system, device control method, computing device and medium

Info

Publication number: CN111696534B
Application number: CN201910199373.9A
Authority: CN
Inventors: 杨昔水; 胡聪钢; 雷京颢; 李奋; 黄启生; 李岳冰; 刘兆健; 刘畅; 风翮
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2023-05-23
Anticipated expiration: 2039-03-15
Also published as: CN111696534A

Abstract

A voice interaction device, a voice interaction system, and a device control method are disclosed. The voice interaction device receives from the internet of things server and stores device information for one or more devices at a local storage module. The voice interaction module performs voice interaction with the user. A voice processing module local to the voice interaction device recognizes entity information and intention information from the voice received by the voice interaction module. The device management module local to the voice interaction device sends an instruction corresponding to the intention information to the device corresponding to the entity information. By disposing the voice recognition, natural language understanding and Internet of things equipment control instruction packaging module on the local voice interaction equipment, the localization of Internet of things equipment management is realized, network overhead on a cloud link is reduced, and a cloud voice recognition result is not required to be waited, so that response time from speaking of a user to control is shortened, response speed is improved, and user experience is improved.

Description

Voice interaction device and system, device control method, computing device and medium

Technical Field

The present disclosure relates to the field of voice interaction, and in particular, to a voice interaction device, a voice interaction system, and a device control method using a voice control device.

Background

With the popularization and development of the internet of things, various internet of things devices are controlled by voice.

Generally, a voice interaction device of a terminal uploads received voice to a cloud, and voice recognition and natural language understanding are performed by means of powerful processing capacity of the cloud. And the Internet of things server generates a control instruction according to the natural language understanding result and transmits the control instruction to corresponding Internet of things equipment.

While the user speaks, the voice interaction device may upload the audio file to the cloud, and recognize and output text through the cloud Automatic Speech Recognition (ASR) engine.

However, after the user speaks the word, the voice interaction device needs to determine whether the user speaks the word through an algorithm. Typically, the voice interaction device will wait for about 1 second and then turn off the microphone, i.e. turn off the microphone.

Then, the cloud ASR text is output to a cloud Natural Language Understanding (NLU) engine, and Natural Language Processing (NLP) information is output.

When the voice of the user relates to the control of the Internet of things equipment, the cloud end sends NLP information to an IOT server. And the IOT server side performs instruction encapsulation and issues control instructions to the corresponding Internet of things equipment.

However, the analysis of the audio file by the cloud ASR engine is time-consuming, and waiting for text output of the cloud ASR at the end requires waiting time, and also requires network overhead.

In addition, IOT device control also needs to go through IOT services, and in this process, there are many applications on the link, and each application in the middle will be time-consuming.

Thus, when the IOT device is controlled by a voice interaction device, the response tends to be slow.

Therefore, a new voice interaction scheme is needed to improve the response speed and the user experience.

Disclosure of Invention

The invention aims to provide a control scheme of voice interaction equipment, which can improve corresponding speed and improve user experience.

According to a first aspect of the present disclosure, there is provided a voice interaction device comprising: a storage module for storing device information of one or more devices; the voice interaction module is used for carrying out voice interaction with a user; the voice processing module is used for recognizing entity information and intention information in the voice received from the voice interaction module; and the device management module is used for sending an instruction corresponding to the intention information to the device corresponding to the entity information according to the device information stored by the storage module.

Optionally, the voice processing module includes: the voice recognition module is used for recognizing the voice into text; the confidence coefficient judging module is used for judging whether the confidence coefficient of the identified text reaches a preset confidence coefficient threshold value or not; and the natural language understanding module is used for analyzing the recognized text to obtain entity information and intention information under the condition that the confidence degree reaches the confidence degree threshold.

Optionally, the voice processing module further identifies attribute information associated with the entity information and/or the intention information from the voice, the device management module determines a device corresponding to the entity information according to the attribute information, and/or generates an instruction corresponding to the intention information based on the attribute information.

Optionally, the device information of the device includes an instruction protocol for instructions of the device, and the device management module generates the instructions according to the instruction protocol.

Optionally, the voice interaction device further comprises: and the communication module is used for communicating with an Internet of things server for managing one or more devices and receiving device information of the one or more devices from the Internet of things server.

Optionally, the device management module searches the device information of the device corresponding to the entity information from the storage module, and when the device information of the device corresponding to the entity information is found, the device management module generates an instruction and sends the instruction to the device corresponding to the entity information.

Optionally, the communication module is further configured to communicate with a voice processing server, upload the voice received by the voice interaction module to the voice processing server, so that the voice processing server performs voice recognition and natural language understanding, send a result of the natural language understanding to the internet of things server, and send a message for terminating the voice recognition and/or the natural language understanding to the voice processing server when the device management module finds device information of the device corresponding to the entity information.

Optionally, in a case that the device management module does not find the device information of the device corresponding to the entity information, or in a case that the found device is not suitable for executing the operation corresponding to the intention information, after the voice interaction module determines that the voice is ended, the communication module sends a voice ending message to the voice processing server, and the voice processing server carries out natural language understanding on the text obtained by voice recognition and sends a natural language understanding result to the internet of things server.

Optionally, a hit flag is maintained, the hit flag is maintained as a first state indicating no hit until the device management module finds the device information of the device corresponding to the entity information, and the hit flag is set as a second state indicating hit in the case where the device management module finds the device information of the device corresponding to the entity information, and the communication module determines whether to send a message terminating speech recognition and/or natural language understanding and/or a speech ending message to the server according to the hit flag.

Optionally, the communication module receives the device information issued by the internet of things server in batches, and the data volume of each batch does not exceed a predetermined data volume threshold.

Optionally, the one or more devices are devices associated with a voice interaction device; and/or the voice interaction device is an intelligent sound box or a voice processing module.

According to a second aspect of the present disclosure, there is provided a voice interaction system comprising: one or more devices; and a voice interaction device on which device information of one or more devices is stored for voice interaction with the user, recognizing entity information and intention information from the user's voice, and transmitting an instruction corresponding to the intention information to a device corresponding to the entity information according to the stored device information.

Optionally, the method further comprises: and the internet of things server manages one or more devices and transmits device information of the one or more devices to the voice interaction device.

Optionally, the internet of things server issues the device information to the voice interaction device in batches, and the data volume of each batch does not exceed a predetermined data volume threshold.

Optionally, the method further comprises: the voice processing server receives voice of a user from the voice interaction device, carries out voice recognition and natural language understanding on the received voice, and sends a natural language understanding result to the Internet of things server, wherein the communication module sends a message for terminating voice recognition and/or natural language understanding to the voice processing server when the voice interaction device searches device information of the device corresponding to the entity information, and sends a voice ending message to the voice processing server after judging that the voice is ended when the voice interaction device does not search device information of the device corresponding to the entity information or the searched device is not suitable for executing operation corresponding to the intention information, and the voice processing server responds to the voice ending message and sends the natural language understanding result to the Internet of things server.

According to a third aspect of the present disclosure, there is provided an apparatus control method including: locally storing device information for one or more devices; recognizing the received speech as text; analyzing the identified text to obtain entity information and intention information related thereto; and transmitting an instruction corresponding to the intention information to a device corresponding to the entity information according to the locally stored device information.

Optionally, the device control method is executed by the intelligent sound box device or by the voice processing module.

Optionally, the device includes an internet of things device.

Optionally, the device corresponding to the entity information includes: an intelligent home device; and/or a processing module associated with the home device.

Optionally, the method further comprises: and judging whether the confidence coefficient of the identified text reaches a preset confidence coefficient threshold value, wherein the identified text is analyzed under the condition that the confidence coefficient reaches the confidence coefficient threshold value.

Optionally, the method further comprises: device information for one or more devices is received from an internet of things server that manages the one or more devices.

Optionally, the method further comprises: maintaining a hit flag, keeping the hit flag in a first state indicating no hit before the device management module finds the device information of the device corresponding to the entity information, setting the hit flag in a second state indicating hit when the device management module finds the device information of the device corresponding to the entity information, sending a message for terminating voice recognition and/or natural language understanding to the voice processing server when the hit flag is in the second state, and sending a voice end message to the voice processing server after determining that the voice of the user is ended when the hit flag is in the first state.

According to a fourth aspect of the present disclosure, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of the third aspect described above.

According to a fifth aspect of the present disclosure there is also provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the third aspect above.

Through the voice recognition, natural language understanding and the control instruction encapsulation work of the equipment of the Internet of things originally arranged at the cloud are transplanted to the local voice interaction equipment, network overhead on a cloud link is reduced, and a result of the cloud voice recognition is not required to be waited, so that response time from speaking of a user to control is shortened, response speed is improved, and user experience is improved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

FIG. 1 is a schematic block diagram of a voice interaction device according to the present disclosure;

FIG. 2 is a schematic block diagram of a local speech processing module according to an example of the present disclosure;

FIG. 3 is a schematic flow chart of a voice interaction method according to the present disclosure;

FIG. 4 is a flow diagram of an example of a voice interaction scheme according to the present disclosure;

fig. 5 is a flow diagram of another example of a voice interaction scheme according to the present disclosure.

FIG. 6 is a schematic diagram of a computing device that may be used to implement the above-described method according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The voice interaction system of the present disclosure includes one or more devices, and a voice interaction device.

The voice interaction device has stored thereon device information for the one or more devices.

The voice interaction device performs voice interaction with the user, recognizes entity information and intention information from the voice of the user, and transmits an instruction corresponding to the intention information to a device corresponding to the entity information.

In addition, the voice interaction system can also comprise an Internet of things server and a voice processing server. The functions between the respective devices and the server will be described in detail below.

A voice interaction apparatus and a voice interaction method according to the present invention are described below with reference to fig. 1 to 3.

Fig. 1 is a schematic block diagram of a voice interaction device according to the present disclosure. Fig. 3 is a schematic flow chart of a voice interaction method according to the present disclosure. The method realizes equipment control in a voice interaction mode, and therefore, the method can also be regarded as an equipment control method.

The voice interaction device 10 according to the present disclosure may be, for example, a smart speaker. Through the cooperation of intelligent audio amplifier customer end and high in the clouds, realize with user's voice interaction.

As shown in fig. 1, a voice interaction device 10 according to the present disclosure may include a communication module 11, a storage module 12, a voice interaction module 13, a voice processing module 14, and a device management module 15.

Unlike existing voice interaction devices, such as smart speakers, the voice interaction device 10 according to the present disclosure is provided with a voice processing module 14 and a device management module 15 locally, so that the voice interaction device 10 has the capability of voice recognition and natural language understanding locally, and has the capability of managing the internet of things device.

In another example, the voice interaction device 10 may be implemented as a voice processing module, for example. The voice processing module is associated with other devices (such as office devices or home devices such as sound boxes, refrigerators, computers, man-machine interaction devices and the like) and is in communication connection in a wired or wireless mode, and the voice processing module and the home devices are matched to realize voice interaction and device control schemes of the disclosure.

The functions of the modules in fig. 1 are described in detail below in conjunction with the flowchart of the voice interaction method of fig. 3.

Referring to fig. 3, the voice interaction device 10 may receive device information of one or more devices from an internet-of-things server through its communication module 11 at step S10.

The one or more devices may be devices associated with voice interaction device 10. The devices may be internet of things IOT devices managed by an internet of things IOT server.

The communication module 11 may request device information from the internet of things server in response to the voice interaction device 10 being powered on. After the storage module 12 stores the device information locally, the new device information may be further acquired only when there is new device information, so as to update the device information stored locally.

The data transmission capability of the voice interaction device 10 is often weak in terms of miniaturization and the like. Therefore, the internet of things server issues a larger amount of device information to the voice interaction device 10 in batches, and the data amount of each batch does not exceed the predetermined data amount threshold. Correspondingly, the communication module 11 receives the equipment information issued by the server of the internet of things in batches.

In addition, it should be understood that the voice interaction device 10 may obtain the device information in other manners, such as, for example, preset, through storage medium transfer, manual input, etc.

In step S20, the obtained device information of the one or more devices is stored in the storage module 12.

The voice interaction module 13 performs voice interaction with the user. In steps S30 and S50, the voice processing module 14 recognizes entity information and intention information from the voice received by the voice interaction module 13. Thus, in step S60, the device management module 15 may transmit an instruction corresponding to the intention information to the device corresponding to the entity information according to the device information stored in the storage module 12.

The processing of speech by the speech processing module 14 according to the examples of the present disclosure is described in detail below with reference to fig. 2.

Fig. 2 is a schematic block diagram of a local speech processing module according to an example of the present disclosure.

As shown in fig. 2, the speech processing module 14 may include a speech recognition module (also referred to as a "(local) speech recognition engine" or "(local) ASR engine") 141, a confidence determination module 142, and a natural language understanding module (also referred to as a "(local) natural language understanding engine" or "(local) NLU engine") 143.

In step S30, the voice recognition module 141 recognizes the voice received by the voice interaction module 13 as text.

In a preferred embodiment of the present disclosure, at step S40, it may be determined by the confidence determination module 142 whether the confidence of the recognized text meets a predetermined confidence threshold. For example, it may be determined whether the confidence level reaches the predetermined confidence threshold by performing a determination analysis on the number of words of the recognized text corpus, or the like.

In the event that the confidence level is determined to reach the confidence threshold, the recognized text is passed to the natural language understanding module 143 for analysis.

In the existing voice interaction equipment, after the user speaks the voice, the user needs to wait about 1 second to judge whether the user speaks the voice, the cloud server is informed of the fact that the user speaks the voice after judging that the user speaks the voice, and then the cloud server carries out NLU processing. This time is consumed too long.

By making the confidence determination described above, the present disclosure may further advance the time to begin NLU analysis in addition to placing the voice processing functionality locally on the voice interaction device. The NLU analysis can begin without waiting until it is determined that the user has finished speaking.

In step S50, the recognized text is analyzed by the natural language understanding module 143 to obtain entity information and intention information related thereto.

For example, when the user utters a voice "turn on" the light, "the entity information is" light, "and the intention information is" turn on.

Based on the result of the natural language understanding, the device management module 15 may transmit an instruction corresponding to the intention information to the device corresponding to the entity information at step S60.

Specifically, the device management module 15 may search the storage module 12 for device information of the device corresponding to the entity information.

In the case where device information of a device corresponding to the entity information is found, the device management module 15 generates an instruction and transmits the instruction to the device corresponding to the entity information.

For example, a light-on instruction may be sent to the light.

In addition, the natural language understanding module 143 of the voice processing module 14 may also recognize attribute information associated with the entity information and/or the intention information from the recognized text.

When the attribute information is associated with the entity information, the device management module 15 may determine a device corresponding to the entity information according to the attribute information.

For example, when the user makes a voice "turn on a lamp of a living room", the "living room" is attribute information associated with the "lamp". The device management module 15 sends a turn-on instruction to the lights of the living room.

When the attribute information is associated with the intention information, an instruction corresponding to the intention information is generated based on the attribute information.

For example, when the user makes a voice "raise the temperature of the air conditioner by 2 degrees", the entity information is "air conditioner", the intention information is "temperature raise", and the attribute information is "2 degrees" associated with the intention information. The device management module 15 sends an instruction to raise the temperature by 2 degrees to the air conditioner.

The attribute information is different for different equipment types. For example, the attribute information that a lamp generally supports is on/off of power, color temperature adjustment, temperature control, and higher-end also supports color adjustment.

Thus, the format of the instructions may also vary for different types of devices.

Accordingly, among the device information received from the internet server, for example, through the communication module 11, which is stored in the storage module 12, an instruction protocol of an instruction for the device may be included in addition to information of the ID, type, and the like of the device. The instruction protocol may include information such as the operation code and the format of the attribute values, and the number of bytes occupied by each field, so that the format of the instruction may be defined. .

The device management module 15 may generate (or encapsulate) instructions according to an instruction protocol. After receiving the instruction thus generated, the corresponding device can perform the corresponding operation.

In this way, by issuing instruction protocol information to the voice interaction device 10, even if various types of devices exist, even if new types of devices are added, the device management module 15 local to the voice interaction device 10 can appropriately generate instructions respectively applicable to these devices.

The above describes the scheme that the voice interaction device 10 performs voice processing locally, and the local device management module is used as a local internet of things control end to send instructions to the device to realize device control based on user voice.

In this way, since the device information of the device is stored locally in the voice interaction device 10, the voice interaction device 10 can realize control of the device even in the case of a network disconnection.

On the other hand, the voice interaction device 10 may also upload voice to the voice processing server at the same time. In the case of failure of local voice control processing, device control can be realized through a voice processing server and an internet of things server. In fact, there may be some probability of processing failure locally due to limitations in the processing power of the local device and errors in confidence determinations. The cloud ASR, the cloud NLU and the cloud IOT can ensure the operation expected by the user.

Thus, the communication module can also be used for communicating with the voice processing server, and uploading the voice received by the voice interaction module to the voice processing server. The voice processing server may include, for example, a cloud ASR server (voice recognition server) and a cloud NLU server (natural language understanding server), respectively perform voice recognition and natural language understanding, and send the result of the natural language understanding to a cloud IOT server (internet of things server).

Also, the result of natural language understanding may include entity information and intention information, and additionally may include attribute information. The cloud IOT server may generate an instruction that points to a device corresponding to the entity information and corresponds to the intent information based on the natural language understanding result output by the cloud NLU server. The cloud IOT server may directly send the instruction to the corresponding device, or may forward the instruction to the corresponding device via the voice interaction device 10 or a gateway to which the voice interaction device 10 is connected.

At this time, two paths of local ASR plus local IOT and cloud ASR plus cloud IOT exist to process user voice simultaneously. It is necessary to ensure idempotency of the two paths, i.e. not to issue two repeated instructions to the device as a result of the request being issued through both paths.

To address this problem, a hit flag may be maintained. The hit mark may be provided in the communication module 11, in the device management module 15, in the storage module 12 or in another module.

Before the device management module 15 finds the device information of the device corresponding to the entity information, the hit flag is kept in a first state indicating no hit, for example, set to "0". In some cases, it may be configured that, in a case where the device information of the device corresponding to the entity information is not found and the found device is not suitable for performing the operation corresponding to the intention information, or, before the device information of the device corresponding to the entity information is found and the found device is suitable for performing the operation corresponding to the intention information, the hit flag is set to the second state indicating a hit.

In the case where the device management module finds the device information of the device corresponding to the entity information, the hit flag is set to a second state indicating a hit, for example, to "1". In some cases, it may be configured that, in a case where device information of a device corresponding to the entity information is found and the found device is adapted to perform an operation corresponding to the intention information, the hit flag is set to a second state indicating a hit.

The communication module 11 or a control module (not shown in the figure) of the voice interaction device 10 can learn the result of the current local processing based on the hit status.

In the case where the device management module 15 finds the device information of the device corresponding to the entity information (hit flag is in the second state), the communication module 11 may send a message to the voice processing server to terminate voice recognition and/or natural language understanding. In general, the speech processing server has not yet completed natural language understanding at this time. In this way, the cloud processing link is stopped and the local device management module 15 sends instructions to the device.

On the other hand, in the case where the device management module 15 does not find the device information of the device corresponding to the entity information, or in the case where the found device is not suitable for performing the operation corresponding to the intention information, in other words, in the case where the local processing fails, after the voice interaction module 13 determines that the voice is ended, the communication module 11 transmits a voice end message to the voice processing server.

Accordingly, as described above, the voice processing server performs natural language understanding on the text obtained by voice recognition, and transmits the result of the natural language understanding to the internet of things server. And the server of the Internet of things sends corresponding instructions to the equipment.

Therefore, under the condition that local voice control processing fails, equipment control can be realized through the cloud, and meanwhile idempotency in the two aspects of local and cloud can be ensured.

Two examples of the voice interaction scheme of the present disclosure are described below with reference to fig. 4 and 5.

Fig. 4 is a flow diagram of an example of a voice interaction scheme according to the present disclosure.

In fig. 4 and 5, the communication module and the control module are collectively referred to as "communication and control module 110". It should be understood that the two modules may be separate or may be together. The control module may be responsible for waking up the voice interaction device 100, controlling the modules, etc.

After the voice interaction device 100 is powered on in step S101, in step S102, the communication and control module 110 requests IOT device information from the IOT server.

IOT device information issued from IOT server 500 is received in step S103.

In step S104, the acquired IOT device information is stored in the storage module 120 local to the voice interaction device 100.

In step S105, the voice interaction module 130 receives a voice command of the user, such as "turn on a lamp". In addition, a preset wake-up word can also be arranged before the voice command.

After the voice interaction device 100 is awakened in step S106, the voice received by the voice interaction module 130 is provided to the local ASR engine 140 (i.e., the local voice processing module) for corpus text recognition. The local ASR engine 140 is responsible for text recognition of the user's speech and natural language understanding NLU. The NLU information output includes entity information of the phonetic text, intention information, corpus attribute information, and the like.

The local ASR engine 140 has a confidence threshold set therein. If the corpus of the user reaches the confidence threshold, in step S107, NLU processing is performed on the recognized text, and corresponding NLP information is output. For example { "intent": "open", category: "Lamp" }, meaning that the intention information (intent) is "open", and the entity information (category) is "lamp".

After the processing, the local ASR module directly sends the NLU information to the local IOT module 150 (i.e., the device management module) of the voice interaction device 100. Local IOT module 150 is responsible for IOT device hits, IOT device instruction encapsulation, etc.

In step S108, the local IOT module searches the storage module 120 for IOT device information corresponding to the entity information "lamp" according to the NLU information. If hit (i.e., information of IOT device is found), it may also be determined in step S109 whether the device supports local control and whether the control indicated by the intent information is supported.

If so, local IOT module 150 encapsulates IOT device control instructions and sends the encapsulated instructions to corresponding IOT devices 600 in step S110.

In step S111, the control module on the IOT device performs IOT device control, such as a turn-on operation, according to the received instruction.

On the original cloud link, the wheat closing time is longer, and the average time length is about 3-4 seconds. After the user speaks, the voice interaction device 100 needs to determine whether the user speaks, and wait for about 1 second. In this scenario, optimizing the original cloud link may not reach an ideal user experience.

By transplanting the cloud link to the terminal, the link comprises a local ASR, a local NLU and an IOT control module for localization, so that the cloud is avoided, localized support is achieved, user second-level experience is achieved, and offline control can be achieved.

In the example shown in FIG. 5, in addition to local ASR engine 140, local IOT module 150, there are cloud ASR 300, cloud NLU 400, IOT server 500. Cloud ASR 300, cloud NLU 400 may be collectively referred to as a "speech processing server". Gateway 200 may be a server gateway.

At the same time as the local ASR engine 140 performs speech recognition, the communication and control module 110 also processes the speech upload already at the cloud.

At this time, by maintaining a hit flag, idempotent of the two paths can be ensured, and the same instruction is prevented from being repeatedly sent to the device based on the same voice.

In step S100, the user utters a voice "turn on a lamp", and the voice interaction module 130 receives the voice of the user.

In step S200, the communication and control module 110 uploads the voice audio to the voice processing server. At S210, gateway 200 uploads the audio to cloud ASR 300, performing speech recognition.

On the other hand, in step S310, speech is provided to the local ASR engine 140 (speech processing module) in real time. In this case, wheat can be closed 400ms in advance.

The local ASR engine 140 performs speech recognition and determines whether the recognized text corpus reaches a confidence threshold at step S320.

In the event that the confidence threshold is reached, the NLU result of the local ASR engine 140 is provided to the local IOT module 150 (device management module) at step S410.

In step S420, local IOT module 150 looks up the entity corresponding to the NLU result in storage module 120.

If there is no hit, the hit flag is not modified in step S431, and the cloud link is continued.

In the case that the hit flag is not modified, that is, indicates a local miss, after the voice interaction module 130 determines that the user' S voice is ended, the communication and control module 110 sends a voice ending message (vadEnd) to the server in step S500.

In response to the end of speech message, in step 520, the ASR result output by the cloud ASR 300 in step S510 is sent to the cloud NLU 400.

In step S530, the NLU result of the cloud NLU 400 is sent to the IOT server 500.

IOT server 500 generates a corresponding instruction for the corresponding device according to the NLU result of cloud NLU 400, and may, for example, directly or via gateway 200 and/or local IOT module 150 of the voice interaction device, send the instruction to IOT device 600.

IOT device 600 executes the instructions to implement the operations that the user's voice desires to implement.

In this way, even if the local voice control processing fails, the user's desired operation can be achieved via the cloud link.

On the other hand, if a hit occurs at step S420, local IOT module 150 performs device control locally at step S432, i.e., encapsulates the instruction and sends it to the corresponding IOT device 600. In this case, the hit flag is modified to indicate a local hit, and the control instruction has been issued.

In addition, IOT devices may include devices that support local end-point control (e.g., via bluetooth) and devices that are accessed via a third party cloud. The equipment accessed by the third party cloud can only be controlled through cloud requests. Therefore, for the device accessed by the third party cloud, in step S440, a control request may be sent to the third party cloud. Under the condition of network disconnection, the equipment accessed by the third party cloud can not be controlled.

In step S600, in response to the hit flag having been modified to indicate a local hit, the communication and control module 110 sends a message to the server, e.g., to the gateway 200, sending an end frame ahead of time, terminating ASR.

At this point, although the cloud ASR 300 may have already recognized a portion of text (S610), at step S620, the cloud NLU request is blocked and the ASR result text is not sent to the cloud NLU 400. And ending the cloud link.

Therefore, the situation that the same control instruction is sent out again by the cloud link under the condition that the control instruction is sent out locally is avoided, and idempotency is ensured.

Fig. 6 is a schematic diagram of a computing device that may be used to implement the above-described voice interaction method and device control method according to an embodiment of the present invention.

Referring to fig. 6, computing device 1000 includes memory 1010 and processor 1020.

Processor 1020 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1020 may comprise a general-purpose host processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, the processor 1020 may be implemented using custom circuitry, for example, an application specific integrated circuit (ASIC, application Specific Integrated Circuit) or a field programmable gate array (FPGA, field Programmable Gate Arrays).

Memory 1010 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1020 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1010 may comprise any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1010 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual-layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, causes the processor 1020 to perform the voice interaction method and the device control method described above.

The voice interaction apparatus, the voice interaction system, and the voice interaction method according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, speech processing server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A voice interaction device, comprising:

a storage module for storing device information of one or more devices;

the voice interaction module is used for carrying out voice interaction with a user;

the voice processing module is used for recognizing entity information and intention information from the voice received by the voice interaction module;

the device management module is used for sending an instruction corresponding to the intention information to the device corresponding to the entity information according to the device information stored by the storage module; and

a communication module for communicating with a voice processing server, uploading the voice received by the voice interaction module to the voice processing server so that the voice processing server can perform voice recognition and natural language understanding, and sending the result of the natural language understanding to an internet of things server managing the one or more devices,

in the case that the device management module finds device information of a device corresponding to the entity information from the storage module, the communication module transmits a message terminating voice recognition and/or natural language understanding to the voice processing server.

2. The voice interaction device of claim 1, wherein the voice processing module comprises:

A voice recognition module for recognizing the voice as text;

the confidence coefficient judging module is used for judging whether the confidence coefficient of the identified text reaches a preset confidence coefficient threshold value or not; and

and the natural language understanding module is used for analyzing the recognized text to obtain the entity information and the intention information under the condition that the confidence degree reaches the confidence degree threshold value.

3. The voice interaction device of claim 1, wherein the voice interaction device is configured to,

the speech processing module also identifies attribute information associated with the entity information and/or the intent information from the speech,

the device management module determines a device corresponding to the entity information according to the attribute information and/or generates an instruction corresponding to the intention information based on the attribute information.

4. The voice interaction device of claim 1, wherein the voice interaction device is configured to,

the device information of a device comprises an instruction protocol of instructions for the device,

the device management module generates the instruction according to the instruction protocol.

5. The voice interaction device of claim 1, wherein the voice interaction device is configured to,

the communication module is further configured to communicate with the internet of things server, and receive device information of the one or more devices from the internet of things server.

6. The voice interaction device of claim 5, wherein the voice interaction device comprises a voice interface,

the device management module searches the storage module for device information of a device corresponding to the entity information,

and under the condition that the equipment information of the equipment corresponding to the entity information is found, the equipment management module generates the instruction and sends the instruction to the equipment corresponding to the entity information.

7. The voice interaction device of claim 1, wherein the voice interaction device is configured to,

in the case where the device management module does not find device information of a device corresponding to the entity information or in the case where the found device is not suitable for performing an operation corresponding to the intention information, the communication module transmits a voice ending message to the voice processing server after the voice interaction module determines that the voice is ended,

and the voice processing server carries out natural language understanding on the text obtained by voice recognition and sends a natural language understanding result to the Internet of things server.

8. The voice interaction device of claim 7, wherein the voice interaction device is configured to,

the hit flag is maintained and a hit is made,

before the device management module does not find the device information of the device corresponding to the entity information, maintaining the hit flag as a first state indicating a miss,

In the case where the device management module finds device information of a device corresponding to the entity information, the hit flag is set to a second state indicating a hit,

the communication module determines whether to send the message for terminating the voice recognition and/or natural language understanding and/or the voice ending message to the server according to the hit mark.

9. The voice interaction device of claim 5, wherein the voice interaction device comprises a voice interface,

and the communication module receives the equipment information issued by the Internet of things server in batches, and the data volume of each batch does not exceed a preset data volume threshold value.

10. The voice interaction device of claim 1, wherein the voice interaction device is configured to,

the one or more devices are devices associated with the voice interaction device; and/or

The voice interaction device is an intelligent sound box or a voice processing module.

11. A voice interactive system, comprising:

one or more devices;

the voice interaction device is used for carrying out voice interaction with a user, identifying entity information and intention information from the voice of the user, and sending an instruction corresponding to the intention information to a device corresponding to the entity information according to the stored device information; and

A voice processing server for receiving the voice of the user from the voice interaction device, performing voice recognition and natural language understanding on the received voice, and transmitting the result of the natural language understanding to an internet of things server managing the one or more devices,

and under the condition that the voice interaction equipment searches the equipment information of the equipment corresponding to the entity information, the voice interaction equipment sends a message for terminating voice recognition and/or natural language understanding to the voice processing server.

12. The voice interaction system of claim 11, further comprising:

and the Internet of things server is used for transmitting the equipment information of the one or more equipment to the voice interaction equipment.

13. The voice interaction device of claim 12, wherein the voice interaction device is configured to,

and the Internet of things server issues the equipment information to the voice interaction equipment in batches, wherein the data volume of each batch does not exceed a preset data volume threshold.

14. The voice interactive system of claim 11, wherein the voice interactive system comprises,

in the case where the voice interaction device does not find device information of a device corresponding to the entity information or in the case where the found device is not suitable for performing an operation corresponding to the intention information, after determining that the voice is ended, the voice interaction device transmits a voice ending message to the voice processing server,

And the voice processing server responds to the voice ending message and sends the natural language understanding result to the Internet of things server.

15. The voice interaction system according to claim 11, wherein the voice interaction device is a voice interaction device according to any of claims 1 to 10.

16. A device control method, characterized by comprising:

locally storing device information for one or more devices;

recognizing the received speech as text;

analyzing the identified text to obtain entity information and intention information related thereto; transmitting an instruction corresponding to the intention information to a device corresponding to the entity information according to locally stored device information;

uploading the voice received by the voice interaction module to a voice processing server so that the voice processing server can perform voice recognition and natural language understanding, and sending the result of the natural language understanding to an internet of things server which manages one or more devices;

and sending a message for terminating voice recognition and/or natural language understanding to the voice processing server under the condition that the equipment information of the equipment corresponding to the entity information is locally searched.

17. The apparatus control method according to claim 16, wherein,

the device control method is executed by the intelligent sound box device or by the voice processing module.

18. The apparatus control method according to claim 16, wherein,

the device comprises an internet of things device.

19. The apparatus control method according to claim 16, wherein the apparatus corresponding to the entity information includes:

an intelligent home device; and/or

And a processing module associated with the home device.

20. The apparatus control method according to claim 16, characterized by further comprising:

it is determined whether the confidence level of the recognized text reaches a predetermined confidence threshold,

wherein the recognized text is analyzed if the confidence level is determined to reach the confidence threshold.

21. The apparatus control method according to claim 16, characterized by further comprising:

and receiving device information of the one or more devices from the internet of things server.

22. The apparatus control method according to claim 16, characterized by further comprising:

the hit flag is maintained and a hit is made,

before device information of a device corresponding to the entity information is not found locally, maintaining the hit flag as a first state indicating a miss,

In case device information of a device corresponding to the entity information is found locally, the hit flag is set to a second state representing a hit,

in the case that the hit flag is in the second state, sending a message to the speech processing server terminating speech recognition and/or natural language understanding,

and when the hit mark is in the first state, after the voice of the user is judged to be ended, sending a voice ending message to the voice processing server.

23. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method of any of claims 16-22.

24. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 16-22.