CN112053683A

CN112053683A - Voice instruction processing method, device and control system

Info

Publication number: CN112053683A
Application number: CN201910492557.4A
Authority: CN
Inventors: 林文彬
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2020-12-08
Also published as: WO2020244573A1

Abstract

The invention discloses a method for processing a voice instruction, which comprises the following steps: identifying the behavior intention and the control object information of the user from the voice instruction; determining equipment to be controlled based on the area where the user is located and the control object information; and generating a control instruction for the equipment to be controlled based on the behavior intention. Meanwhile, the invention also discloses a corresponding voice instruction processing device and a corresponding voice instruction processing system.

Description

Voice instruction processing method, device and control system

Technical Field

The present invention relates to the field of voice processing technologies, and in particular, to a method, a device, and a control system for processing a voice command.

Background

In the past decade, the internet has been deepened in every field of people's life, and people can conveniently perform activities such as shopping, social contact, entertainment, financing and the like through the internet. The internet and intelligent equipment permeate into the aspects of life of people.

Some smart voice devices, such as smart speakers, various smart electronic devices (e.g., mobile devices, wearable electronic devices, etc.) including smart interaction modules, are available on the market. In some usage scenarios, the smart voice device may recognize voice data input by the user through a voice recognition technology, thereby providing personalized services to the user. However, there are some limitations to the user's intent that can be understood by the smart voice device for a single voice message.

Based on this, a speech recognition scheme is needed, which can improve the efficiency of speech recognition and provide better interactive experience for users.

Disclosure of Invention

To this end, the present invention provides a method, device and control system for processing voice commands in an attempt to solve or at least alleviate at least one of the problems identified above.

According to one aspect of the present invention, there is provided a method for processing a voice command, comprising the steps of: identifying the behavior intention and the control object information of the user from the voice instruction; determining equipment to be controlled based on the area where the user is located and the control object information; and generating a control instruction for the equipment to be controlled based on the behavior intention.

Optionally, the method according to the invention further comprises the steps of: and sending the control instruction to the equipment to be controlled so that the equipment to be controlled executes the operation in the control instruction.

Optionally, the method according to the invention further comprises the steps of: acquiring a monitoring image, wherein the monitoring image comprises at least one device; generating at least one region in advance based on the monitoring image; and associating at least one region for the devices, respectively.

Optionally, in the method according to the present invention, the step of determining the device to be controlled based on the area where the user is located and the control object information includes: determining the area where the user is located; determining a device associated with an area where a user is located; and determining a device to be controlled from the determined devices based on the control object information.

Optionally, in the method according to the present invention, the step of determining the area where the user is located includes: acquiring a current monitoring image, wherein the monitoring image comprises a user and at least one device; and determining the area where the user is located from the monitoring image.

Optionally, in the method according to the present invention, the step of determining the area where the user is located from the monitoring image includes: detecting a user from a current monitoring image through human body detection; the area in which the detected user is located is determined.

Optionally, in the method according to the present invention, the step of determining a device to be controlled from the determined devices based on the control object information further includes: and selecting the equipment closest to the user from the determined equipment as the equipment to be controlled based on the control object information.

Optionally, in the method according to the present invention, the step of determining a device to be controlled from the determined devices based on the control object information further includes: extracting the detected predetermined gesture of the user; and determining the equipment to be controlled from the determined equipment by combining the control object information and the preset gesture.

Optionally, in the method according to the present invention, the step of generating at least one region in advance based on the monitoring image includes: at least one region is pre-generated based on the monitored image, in combination with the indoor spatial distribution and the location of the device.

Optionally, in the method according to the present invention, the step of generating at least one region in advance based on the monitoring image includes: and generating at least one region in advance based on the monitoring image and the region distribution customized by the user.

According to an aspect of the present invention, there is also provided a method for processing a voice instruction, including the steps of: identifying control object information from the voice command; determining equipment to be controlled based on the area where the user is located and the control object information; and generating a control instruction for the equipment to be controlled.

According to an aspect of the present invention, there is also provided a method for processing a voice instruction, including the steps of: receiving a voice instruction; determining the behavior intention of the user and the equipment to be controlled in the monitoring image based on the voice instruction and the monitoring image; and generating a control instruction for the equipment to be controlled according to the determined action intention.

According to another aspect of the present invention, there is provided a speech instruction processing apparatus including: the first processing unit is suitable for recognizing the behavior intention and the control object information of the user from the voice instruction; the second processing unit is suitable for determining the equipment to be controlled based on the area where the user is located and the control object information; and the instruction generating unit is suitable for generating a control instruction for the equipment to be controlled based on the behavior intention.

According to another aspect of the present invention, there is also provided a control system of voice commands, including: the voice interaction device is suitable for receiving a voice instruction of a user; the image acquisition equipment is suitable for acquiring monitoring images; at least one device; the processing device is coupled to the voice interaction device, the image acquisition device and the device, and is adapted to determine the behavioral intention of the user and the device to be controlled from the at least one device based on the voice instruction and the monitoring image, and generate a control instruction for the device to be controlled, so that the device to be controlled performs an operation in the control instruction.

According to another aspect of the present invention, there is also provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the method as described above.

According to another aspect of the present invention, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the method as described above.

According to the scheme of the invention, the behavior intention and the control object information of the user are identified through the analysis of the voice command, and the equipment to be controlled by the user is further determined according to the control object information. More specifically, the device is associated with a region in the monitoring image, and the device that the user wants to control is analyzed based on the monitoring image.

Under the current scenes that household devices (especially intelligent devices) are various and the number of the household devices is more and more, according to the scheme of the invention, when a user wants to control a certain device through voice, the user does not need to attach the position of the device (such as ' opening a living room air conditioner ', ' opening a main and lying air conditioner ', ' opening a study room air conditioner ', and the like ') each time, and the user only needs to directly say that the certain device is opened or closed, so that the user experience is greatly improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 illustrates a scene schematic of a control system 100 for voice commands according to some embodiments of the invention;

FIG. 2 illustrates a schematic diagram of a computing device 200, according to some embodiments of the invention;

FIG. 3 illustrates a flow diagram of a method 300 of processing voice instructions according to some embodiments of the invention;

FIG. 4 shows a schematic diagram of a monitoring image according to an embodiment of the invention;

FIG. 5 shows a schematic diagram of a monitoring image according to another embodiment of the invention; and

FIG. 6 illustrates a schematic diagram of a processing device 140 for voice instructions according to some embodiments of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 illustrates a scene schematic of a control system 100 for voice commands according to some embodiments of the invention. As shown in FIG. 1, the system 100 includes a voice interaction device 110, an image capture device 120, at least one device 130, and a processing device 140 for voice instructions. It should be noted that the system 100 shown in fig. 1 is only an example, and those skilled in the art will understand that in practical applications, the system 100 may contain a plurality of voice interaction devices 110 and image capture devices 120, for example, in a home scene, one voice interaction device 110 and one image capture device 120 may be respectively arranged in each room. The present invention does not limit the number of devices included in the system 100.

The voice interaction device 110 is a device having a voice interaction module, and can receive a voice command from a user and return a corresponding response to the user, where the response may include voice or non-voice information. A typical voice interaction module includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor. The voice interaction module may be built in the voice interaction device 110, or may be used as a separate module to cooperate with the voice interaction device 110 (for example, to communicate with the voice interaction device 110 via an API or by other means, and to call a service of a function or an application interface on the voice interaction device 110), which is not limited by the embodiment of the present invention. The voice interaction device 110 may be, for example, a smart speaker with a voice interaction module, a smart robot, other mobile devices, and the like, without being limited thereto.

Image capture device 120 is used to monitor the dynamics of a scene that includes a user and device 130. In some embodiments, image capture device 120 captures video images in a scene as surveillance images. One application scenario for system 100 is a home scenario, in which case there may be more than one image capture device 120. In some embodiments, one image capturing device 120 is disposed in each of the bedrooms, living rooms, dining rooms, kitchens, balconies, etc.; even when the space is large (e.g., living room), more than one image pickup device 120 may be arranged.

The device 130 may be, for example, various smart devices, such as a mobile terminal, a wearable device, etc.; or some simple device. For example, in a home scenario, the device 130 may be a smart television, a smart refrigerator, a smart air conditioner, a smart microwave oven, a smart curtain, or a simple home device such as a switch, as long as it can communicate with the processing device 140 of voice command through the communication module.

According to some embodiments, the user may issue voice instructions to the voice interaction device 110 to perform certain functions, such as surfing the internet, ordering songs, shopping, knowing about weather forecasts, etc.; the device 130 may also be controlled by voice commands, such as controlling the smart air conditioner to adjust to a certain temperature, controlling the smart tv to play a movie, controlling the intelligent light fixture to turn on or off, adjusting the color temperature, controlling the intelligent curtain to turn on or off, and so on.

The voice interaction device 110, the image capture device 120, and the device 130 are all coupled to a voice command processing device 140 via a network for communication.

According to the embodiment of the present invention, the voice interaction device 110 receives a voice instruction of the user in the wake state and transmits the voice instruction to the processing device 140, so that the processing device 140 recognizes the behavioral intention and the control object information of the user when receiving the voice instruction. The control object information includes information of any one of the devices 130, such as, but not limited to, a device name, a device class, a device identification, and the like. The processing device 140 can determine the device to be controlled to which the control object information is directed by recognizing the control object information.

Of course, the voice interaction device 110 may also have a voice recognition capability, and when receiving a voice command from the user, recognize the voice command first, recognize the behavior intention and the control object information of the user, and send the recognition results to the processing device 140. For example, the user issues a voice command, i.e., "turn on the air conditioner", and recognizes the voice command to find that the user's action intention is "turn on" and the control target information is "air conditioner".

Then, the processing device 140 acquires the monitor image at this moment to the image pickup device 120. According to the embodiment of the present invention, the processing device 140 may acquire the monitoring image at the moment when the voice instruction is received, and may also acquire the monitoring image at and a short time before the moment when the voice instruction is received (e.g., 5 seconds before the voice instruction is received), without being limited thereto. In some embodiments, the processing device 140 may acquire the monitoring images to all of the image capture devices 120. In other embodiments, the processing device 140 may store the voice interaction device 110 and the image capturing device 120 in a pre-association manner, so that the processing device 140 may acquire the monitoring image from the image capturing device 120 associated with the processing device after receiving the voice instruction from the voice interaction device 110. The embodiments of the present invention are not limited thereto.

In this way, the processing device 140 for the voice command processes the voice command based on the voice command and the monitoring image to determine the action intention of the user and the device to be controlled 130 in the monitoring image, and then generates a control command for the device to be controlled 130 according to the action intention and the determined device to be controlled 130, and sends the control command to the device to be controlled 130, so that the device to be controlled performs the operation in the control command (the specific process of processing the voice command by the processing device 140 for the voice command will be described in detail in the following description of the method 300).

In one embodiment, the processing device 140 of the voice instructions may be, for example, a cloud server physically located at one or more sites. It should be noted that the processing device 140 of the voice command can also be implemented as other electronic devices connected to the voice interaction device 110 or the like through a network (e.g., other computing devices in an internet of things environment). The processing device 140 of the voice instructions may also be implemented as the voice interaction device 110 itself, in case the voice interaction device 110 has sufficient memory capacity and computing power. Furthermore, the image capturing device 120 may also be arranged as a part of the voice interaction device 110, i.e. a voice interaction device 110 integrating voice interaction, image capturing and voice instruction processing is realized. Embodiments of the invention are not so limited.

According to an embodiment of the present invention, the voice interaction device 110, the image capture device 120, the device 130, and the processing device 140 of voice instructions in the system 100 may all be implemented by a computing device 200 as described below. FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention.

As shown in FIG. 2, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 260, which may be arranged to facilitate communications with one or more other computing devices 262 over a network communication link via one or more communication ports 264.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 200 may be implemented as a personal computer including desktop and notebook computer configurations, as well as a server, such as a file server, database server, application server, WEB server, and the like. Of course, computing device 200 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment in accordance with the invention, computing device 200 is configured to perform a method 300 of processing voice instructions in accordance with the invention. The application 222 of the computing device 200 includes a plurality of program instructions that implement the method 300 according to the present invention.

FIG. 3 illustrates a flow diagram of a method 300 of processing voice instructions according to some embodiments of the invention. The method is suitable for execution in a processing device 140 of speech instructions. Referring to FIG. 3, the method 300 begins in step S310.

In step S310, the behavioral intention and control object information of the user are recognized from the voice instruction.

In some embodiments, the processing device 140 of the Speech instruction recognizes the Speech instruction by an asr (automatic Speech recognition) Speech recognition technique. For example, the voice command may be represented as text data, and then the text data may be subjected to word segmentation to obtain a corresponding text representation (it should be noted that other manners may also be used to represent the voice command, and the embodiment of the present invention is not limited to the text representation). A typical ASR speech recognition method may be, for example: a method based on vocal tract models and speech knowledge, a method of template matching, etc., without being limited thereto. The processing device 140 of the voice instructions then processes the text representation to understand the user intent, ultimately resulting in a representation of the user intent. In some embodiments, the processing device 140 may adopt an nlp (natural Language processing) natural Language processing method to understand the voice instruction of the user, and recognize the behavior intention of the user, which often corresponds to the actual operation, such as opening, closing, playing, and the like. Meanwhile, the processing device 140 may further determine other parameters intended by the user, such as control object information, which records information of the device to be controlled by the user, so as to be able to determine the device 130 to be controlled by the user, i.e., which device to turn on and which device to turn off, according to the control object information.

Furthermore, the processing device 140, when identified by ASR techniques, may also include some pre-processing operations on the speech instructions, such as: sampling, quantizing, removing speech data that does not contain speech content (e.g., silent speech data), framing, windowing, etc., the speech data, etc. Embodiments of the present invention are not overly extensive herein.

It should be noted that the embodiment of the present invention is not limited to what ASR algorithm or NLP algorithm is used to understand the user's intention from the speech instruction, and any known or future known such algorithm may be combined with the embodiment of the present invention to implement the method 300 of the present invention.

As described above, when the voice interaction device 110 has sufficient computing power, the voice interaction device 110 may also recognize the voice command of the user, and directly transmit the recognized behavior intention and the control object information of the user to the processing device 140 of the voice command. The embodiments of the present invention are not so limited.

In one embodiment according to the present invention, the user inputs a voice instruction, i.e., "turn on the air conditioner", the processing device 140 recognizes that the behavioral intention of the user is "turn on" and the control object information is "air conditioner" after analysis. At this time, if only one air conditioner is connected to the processing device 140, the processing device 140 may directly generate a corresponding control command to the air conditioner to indicate that it is in the on state. However, in a home scenario, there are typically multiple air conditioners (all installed in a living room, a dining room, a bedroom, and a study room), and at this time, the processing device 140 needs to further determine which air conditioner the user wants to turn on. Therefore, in the subsequent step S320, the device to be controlled is determined based on the area where the user is located and the control object information.

According to one embodiment, when the control object information corresponds to more than one device 130, the device to be controlled by the user is determined according to the position where the user is located at the time. Preferably, when the device to which the control object information is directed is within a certain range of the location where the user is located, the device is regarded as the device to be controlled. According to the embodiment of the present invention, the position of the user is determined by the monitoring image acquired by the image acquisition device 120, and thus the devices around the user are determined. Specifically, the method 300 further includes the following 3 steps.

1) And acquiring a monitoring image, wherein the monitoring image comprises at least one device.

FIG. 4 shows a schematic diagram of a monitoring image according to one embodiment of the invention. As shown in fig. 4, the monitoring image is an image of a living room and a restaurant. The image capturing device 120 is disposed on the left side of the restaurant curtain, but is not limited thereto. Taking a common home scenario as an example, the devices 130 included in the living room and the restaurant are: a living room lamp 401, a television 402, a living room air conditioner 403, a living room curtain 404, a restaurant lamp 405, a restaurant air conditioner 406, and a restaurant curtain 407.

2) At least one region is generated in advance based on the acquired monitoring image.

According to one embodiment, the at least one region is pre-generated based on the monitored image, in combination with the indoor spatial distribution and the location of the device 130. For example, according to the indoor spatial distribution, a part of a living room in the monitoring image is regarded as one area, and a part of a restaurant in the monitoring image is regarded as another area. As another example, the monitoring image is divided into two regions, the left region and the right region. Further, the monitoring image may be divided into a plurality of regions in consideration of the position of the device.

According to yet another embodiment, at least one region is pre-generated based on the monitored image and a user-defined region distribution. The user can customize the areas according to his/her living habits, for example, the center area of the living room is defined as area 1, the center area of the restaurant is defined as area 2, and the remaining areas are defined as areas 3.

As in fig. 4, the monitored image was divided into 6 regions, labeled ROI1, ROI2, ROI3, ROI4, ROI5, ROI 6. It should be noted that the region may be a rectangular, circular or any irregular shaped region, and the shape, size and number of the region division are not limited by the embodiments of the present invention.

3) At least one of the generated regions is associated with each device in the monitoring image.

Generally, if device a is in region R1, device a is associated with region R1. Of course, it may be set according to the user's preference, for example, when the device B is located at the boundary between the regions R1 and R2, the user may customize whether the device B is associated with the region R1 or the region R2. Preferably, one zone is associated for each device. Of course, in some special cases, more than one area may be associated with a device. For example, when device C is in both region R1 and region R2, region R1 and region R2 may be associated for device C.

As table 1, association of each device with a region in fig. 4 is exemplarily shown.

TABLE 1 Association of devices with regions example

According to the embodiment of the present invention, when a plurality of image capturing devices 120 exist in the system 100, a corresponding area may be generated for the monitoring image of each image capturing device 120, respectively. And will not be described in detail herein.

After generating the areas and associating the areas for each device, according to one embodiment of the present invention, step S320 is implemented by the following three steps.

In the first step, the area where the user is located is determined.

According to one embodiment, a current monitoring image is obtained, the monitoring image including a user and at least one device. As described above, the "current monitoring image" may be a monitoring image at the moment when the voice instruction of the user is received, or may be a monitoring image within a short time before the voice instruction of the user is received, which is not limited in this embodiment of the present invention.

Then, the area where the user is located is determined from the monitoring image. In one embodiment, human body (i.e., user) is detected from the current monitoring image through human body detection, and the area where the detected user is located is determined. As in fig. 4, the user is in the region ROI 1. It should be noted that the human body in the monitored image may be detected by using a conventional target recognition algorithm, or may also be detected by using an algorithm based on deep learning or an algorithm based on motion detection, which is not limited in this embodiment of the present invention.

In a second step, the device associated with the area in which the user is located is determined.

In conjunction with the foregoing description, in the monitoring image shown in fig. 4, the devices associated with the region ROI1 are the living room air conditioner 403 and the living room curtain 404.

And thirdly, determining the equipment to be controlled from the determined equipment based on the control object information.

Continuing with the previous example, the voice command is- "turn on air conditioner", and the control object information is "air conditioner", so that, in combination with the area where the user is located, it can be determined that the device to be controlled is "living room air conditioner 403".

In other embodiments, there may be more than one device corresponding to the control object information in the obtained devices, and at this time, based on the control object information, one device closest to the user is selected as the device to be controlled. For example, when the voice command is "turn on the light", the control object information is "light fixture", and if the device associated with the area is acquired at this time to have multiple light fixtures such as a desk lamp and a spot lamp, one light fixture closest to the user is selected from the multiple light fixtures as the device to be controlled. In this embodiment, the position of the user in the monitored image can be determined through human body detection, and the position of the device in the monitored image can be calibrated in advance, so that the device closest to the user can be determined based on the position coordinates.

In still other embodiments, when there may be more than one device corresponding to the control object information among the acquired devices, the device to be controlled is determined in the manner described below.

When a user wants to initiate a voice instruction, the synchronization points to the device to be controlled. The gesture of the user pointing at the device to be controlled is taken as the predetermined gesture. Thus, in step S310, the voice interaction apparatus 110 transmits the voice instruction to the processing apparatus 140 of the voice instruction, which analyzes the behavioral intention and the control object information of the user. In step S320, the processing device 140 first obtains a corresponding monitoring image from the image capturing device 120, detects at least one human body through human body detection, and determines at least one region according to the at least one human body. On the basis, the detected preset gesture (namely, the action of pointing to the device to be controlled by the hand) of the human body (namely, the user) is extracted, then, the control object information and the preset gesture are combined, the direction pointed by the hand of the user is determined according to the preset gesture, and the device corresponding to the control object information in the direction pointed by the hand is determined from the determined area to be used as the device to be controlled. According to the embodiments of the present invention, it is possible to determine the predetermined gesture and determine the pointing direction thereof by using the conventional image processing algorithm, and determine the device according to the pointing direction (for example, calculating an approximate angle according to the pointing direction of the hand, and determining the associated device within the angle range). In addition, the predetermined gesture may be set as another gesture according to the user habit, and this is only an example, and the predetermined gesture is not limited by the embodiment of the present invention.

Fig. 5 shows a schematic view of a monitoring image according to another embodiment of the invention. As shown in fig. 5, the monitoring image is an image of a bedroom. The equipment contained in the bedroom is: a bedroom central ceiling lamp 501, a bedroom lamp strip 502, a bedroom television 503, a bedroom air conditioner 504, a bedroom curtain 505 and a bedroom desk lamp 506. As shown in fig. 5, the monitored image is divided into 3 regions, which are respectively designated as ROI7, ROI8, and ROI 9. The association of the regions with the device 130 is shown in table 2.

TABLE 2 Association of devices with regions example

Region(s)	Device
		ROI7	Bedroom center ceiling lamp 501 and bedroom curtain 505
ROI8	Bedroom lamp strip 502, bedroom TV 503, bedroom desk lamp 506
		ROI9	Bedroom air conditioner 504

In the monitoring image of fig. 5, the user issues a voice command, turn "on light," while pointing to the direction of the desk lamp 506 with a hand. The processing device 140 first recognizes that the user's action intention is "on" and the control object information is "light fixture". Then, by analyzing the monitoring image, the user is detected and the region where the user is located is identified as ROI8, and at this time, two devices corresponding to the control object information are identified: a bedroom light bar 502 and a bedroom desk lamp 506. Further, the gesture of the user is extracted and the pointing direction of the gesture is determined to be the direction of the table lamp 506, so that the device to be controlled is determined to be the bedroom table lamp 506.

It should be noted that, in addition to the above-described scenarios, the following scenarios may also occur: more than one user is detected in the monitored image (which means that the area in which more than one user is located may be determined). When the areas where a plurality of users are located are determined, the various manners described above may be combined to finally determine the device to be controlled. For example, at least one device corresponding to the control object information is determined from the plurality of areas, then the distances between the device and the corresponding users (i.e., the users in the area associated with the device) are respectively calculated, and the device with the smallest distance value is selected as the device to be controlled. In another example, whether each detected user has a predetermined gesture is determined, the area where the user having the predetermined gesture is located is determined as a final area, and the device associated with the area is screened out as the device to be controlled.

Subsequently, in step S330, a control instruction for the device to be controlled is generated based on the action intention. Taking the scenario of fig. 4 as an example, the voice instruction is- "turn on air conditioner", and it is determined that the device to be controlled is "living room air conditioner 403", the control instruction generated by the processing device 140 may be "turn on living room air conditioner 403", where "turn on" is an instruction to be executed, and "living room air conditioner 403" is an instruction recipient, that is, the device to be controlled.

According to an embodiment of the present invention, the processing device 140 transmits the generated control instruction to the device to be controlled, which performs an operation according to the control instruction. For example, the processing device 140 sends a control instruction to the air conditioner in living room 403, and the air conditioner in living room 403 performs a turn-on operation in response to the user after receiving the control instruction.

In other implementations, the voice command input by the user may be more concise. According to one embodiment, when the control state of the device is relatively simple, such as only on and off states, the voice command issued by the user may contain only control object information. For example, the user only needs to issue voice commands- "light/television", etc., and the processing device 140 analyzes the user's behavioral intent according to the current state of the device 130.

At this time, in step S310, the processing device 140 recognizes the control object information from the voice instruction. For example, the user inputs a voice command, i.e., "lamp", and the processing device 140 may recognize that the control object information is "lamp" from the voice command. In the following step, the description of the previous step S320 and step S330 is continued, the device to be controlled is determined based on the area where the user is located and the control object information, and a control instruction for the device to be controlled is generated, which is not described herein again. It should be appreciated that for control of the light fixtures, there are typically both light on and light off, and thus, the processing device 140 may determine the user's behavioral intent in conjunction with the current "light" state (whether on or off). For example, if the headlights are on, the user's action intention is to turn off, and a control command, i.e., "turn off the lights"; if the headlights are off, the user's action is intended to turn on, and a control command- "turn on the lights" is generated.

In still other implementations scenarios, the user may have expressed an intent prior to issuing voice instructions regarding controlling device information. According to one embodiment, the user may express the intention in advance by means such as voice or gesture, without being limited thereto. For example, the user first issues a voice command, the "bedroom is good and dark", and then issues a voice command, the "light". At this time, the processing apparatus 140 recognizes that the control object information is "light" according to the received voice instruction, and analyzes that the behavioral intention of the user is "light on" in conjunction with the previous voice instruction. In the following step, the description of the previous step S320 and step S330 is continued, the device to be controlled (i.e. which lamp the user wants to turn on) is determined based on the area where the user is located and the control object information, and a control instruction for the device to be controlled is generated, which is not described herein again.

According to the scheme of the invention, the equipment is associated with the area in the monitoring image, and the equipment which the user wants to control is automatically determined by analyzing the voice instruction of the user and the current monitoring image. Under the current scenes that household devices (particularly various intelligent devices) are various and the number of the household devices is more and more, according to the scheme of the invention, when a user wants to control a certain device through voice, the user does not need to attach the position of the device (such as ' opening a living room air conditioner ', ' opening a main and lying air conditioner ', ' opening a study room air conditioner ', and the like ') each time, and the user only needs to directly say that the certain device is opened or closed, so that the user experience is greatly improved.

FIG. 6 illustrates a schematic diagram of a processing device 140 for voice instructions according to some embodiments of the invention. As shown in fig. 6, the processing device 140 for voice instructions includes a first processing unit 142, a second processing unit 144 and a response generating unit 146 coupled to each other. Wherein,

the first processing unit 142 recognizes the behavioral intention and the control object information of the user from the voice instruction. The second processing unit 144 determines a device to be controlled based on the area where the user is located and the control object information. The instruction generation unit 146 generates a control instruction for the device to be controlled based on the action intention.

It should be appreciated that the detailed description of the processing device 140 can be found in relation to the description of the method 300, which is not expanded upon herein.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims

1. A processing method of voice commands comprises the following steps:

identifying the behavior intention and the control object information of the user from the voice instruction;

determining equipment to be controlled based on the area where the user is located and the control object information; and

and generating a control instruction for the equipment to be controlled based on the behavior intention.

2. The method of claim 1, wherein after generating the control instruction for the device to be controlled, further comprising the steps of:

and sending the control instruction to the equipment to be controlled so that the equipment to be controlled can execute the operation in the control instruction.

3. The method of claim 1 or 2, further comprising the step of:

acquiring a monitoring image, wherein the monitoring image comprises at least one device;

generating at least one region in advance based on the monitoring image; and

at least one area is associated with each of the devices.

4. The method of claim 1, wherein the determining of the device to be controlled based on the area where the user is located and the control object information comprises:

determining the area where the user is located;

determining a device associated with an area where a user is located; and

and determining the equipment to be controlled from the determined equipment based on the control object information.

5. The method of claim 4, wherein the step of determining the area in which the user is located comprises:

acquiring a current monitoring image, wherein the monitoring image comprises a user and at least one device;

and determining the area where the user is located from the monitoring image.

6. The method of claim 5, wherein the step of determining the region in which the user is located from the monitoring image comprises:

detecting a user from a current monitoring image through human body detection;

the area in which the detected user is located is determined.

7. The method of claim 6, wherein the determining of the device to be controlled from the determined devices based on the control object information further comprises:

and selecting the equipment closest to the user from the determined equipment as the equipment to be controlled based on the control object information.

8. The method of claim 6, wherein the determining of the device to be controlled from the determined devices based on the control object information further comprises:

extracting the detected predetermined gesture of the user;

and determining the equipment to be controlled from the determined equipment by combining the control object information and the preset gesture.

9. The method of claim 3, wherein the pre-generating at least one region based on the monitoring image comprises:

at least one region is pre-generated based on the monitored image, in combination with the indoor spatial distribution and the location of the device.

10. The method of claim 3, wherein the pre-generating at least one region based on the monitoring image comprises:

and generating at least one region in advance based on the monitoring image and the region distribution customized by the user.

11. A processing method of voice commands comprises the following steps:

identifying control object information from the voice command;

determining equipment to be controlled based on the area where the user is located and the control object information;

and generating a control instruction for the equipment to be controlled.

12. A processing method of voice commands comprises the following steps:

receiving a voice instruction;

determining the behavior intention of the user and the equipment to be controlled in the monitoring image based on the voice instruction and the monitoring image;

and generating a control instruction for the equipment to be controlled according to the determined action intention.

13. A device for processing voice instructions, comprising:

the first processing unit is suitable for recognizing the behavior intention and the control object information of the user from the voice instruction;

the second processing unit is suitable for determining the equipment to be controlled based on the area where the user is located and the control object information;

and the instruction generating unit is suitable for generating a control instruction for the equipment to be controlled based on the behavior intention.

14. A control system for voice commands, comprising:

the voice interaction device is suitable for receiving a voice instruction of a user;

the image acquisition equipment is suitable for acquiring monitoring images;

at least one device;

the processing device according to claim 12, coupled to the voice interaction device, the image capturing device, and the device, respectively, and adapted to determine a behavioral intention of the user and a device to be controlled from the at least one device based on the voice instruction and the monitoring image, and generate a control instruction for the device to be controlled, so that the device to be controlled performs an operation in the control instruction.

15. A computing device, comprising:

at least one processor; and

a memory storing program instructions configured for execution by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-12.

16. A readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-12.