CN109032039B

CN109032039B - Voice control method and device

Info

Publication number: CN109032039B
Application number: CN201811031798.0A
Authority: CN
Inventors: 许超
Original assignee: Mobvoi Innovation Technology Co Ltd
Current assignee: Mobvoi Innovation Technology Co Ltd
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2021-05-11
Anticipated expiration: 2038-09-05
Also published as: CN109032039A

Abstract

The embodiment of the invention discloses a voice control method and a voice control device, which are used for avoiding misoperation caused by the fact that a plurality of devices simultaneously respond to voice instructions of a user. The method comprises the following steps: obtaining a user posture image, wherein the user posture image is acquired by at least one acquisition device positioned in a preset space at a first moment; determining target controlled equipment which is intended to be controlled by the user from at least one controlled equipment in the preset space according to the user posture image; and controlling the target controlled equipment to respond to the voice instruction input by the user at the first moment.

Description

Voice control method and device

Technical Field

The present invention relates to the field of terminal applications, and in particular, to a method and an apparatus for voice control.

Background

In a conventional method for controlling a plurality of devices, respective remote controllers of the devices are generally used for controlling the devices, and the remote controllers are often not universal and are too cumbersome to operate. In order to achieve a simpler, more natural way of operation for controlling the device, voice control has come to mind.

Currently, in order to implement a voice control method, a camera or a voice device is installed on a controlled device to implement visual recognition or voice recognition. In an actual application environment, a plurality of devices supporting a voice control mode may be provided in the same space, and the devices are provided with a camera and voice software, so that misoperation is easily caused in the voice control process.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for voice control, which mainly aim to avoid an erroneous operation caused by a plurality of devices simultaneously responding to a voice command of a user.

According to a first aspect of the embodiments of the present invention, there is provided a method for voice control, including: obtaining a user posture image, wherein the user posture image is acquired by at least one acquisition device positioned in a preset space at a first moment; determining target controlled equipment which is intended to be controlled by the user from at least one controlled equipment in the preset space according to the user posture image; and controlling the target controlled equipment to respond to the voice instruction input by the user at the first moment.

In an embodiment of the present invention, the determining, according to the user gesture image, a target controlled device that a user intends to control from at least one controlled device in the preset space includes: determining a body angle of the user, a face angle of the user and/or a sight line angle of the user according to the user posture image; and determining a target controlled device facing the user in the at least one controlled device as the target controlled device according to the body angle of the user, the face angle of the user and/or the sight line angle of the user.

In an embodiment of the present invention, the obtaining the user gesture image includes: receiving at least one image from the at least one acquisition device; determining an image with a timestamp of the first time from the at least one image; according to a preset target user model, performing target detection on the image with the time stamp being the first moment, and determining an image containing a target user, wherein the target user is a user of the at least one controlled device; and determining the determined image containing the target user as the user posture image.

In this embodiment of the present invention, the controlling the target controlled device to respond to the voice instruction input by the user at the first time includes: and sending a control instruction to the target controlled device, wherein the control instruction is used for indicating the target controlled device to respond to the voice instruction input by the user at the first moment.

In an embodiment of the present invention, the obtaining the user gesture image includes: acquiring a voice instruction input by a user at the first moment; identifying the user who inputs the voice command according to a preset user voiceprint model; and when the user is identified to be a legal user, acquiring the user posture image.

In this embodiment of the present invention, the controlling the target controlled device to respond to the voice instruction input by the user at the first time includes: performing voice recognition on the voice instruction; and responding to the voice instruction, and executing corresponding target operation.

According to a second aspect of the embodiments of the present invention, there is provided a voice-controlled apparatus, including: an obtaining unit, configured to obtain a user posture image, where the user posture image is acquired at a first time by at least one acquisition device located in a preset space; the determining unit is used for determining a target controlled device which is intended to be controlled by the user from at least one controlled device in the preset space according to the user posture image; and the control unit is used for controlling the target controlled equipment to respond to the voice instruction input by the user at the first moment.

In an embodiment of the present invention, the determining unit is specifically configured to determine, according to the user posture image, a body angle of the user, a face angle of the user, and/or a line-of-sight angle of the user; and determining a target controlled device facing the user in the at least one controlled device as the target controlled device according to the body angle of the user, the face angle of the user and/or the sight line angle of the user.

According to a third aspect of embodiments of the present invention, there is provided an electronic apparatus, including: at least one processor; and at least one memory, bus connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call the program instructions in the memory to perform the voice-controlled method according to one or more of the above technical solutions.

According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a computer to execute a voice control method according to one or more of the above-mentioned technical solutions.

With the above technical solution, in the method and apparatus for voice control according to embodiments of the present invention, a user gesture image acquired by at least one acquisition device located in a preset space at a first time is obtained, then a target controlled device intended to be controlled by a user is determined from at least one controlled device in the preset space according to the user gesture image, and finally the target controlled device is controlled to respond to a voice instruction input by the user at the first time.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic flow chart illustrating an implementation of a voice control method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a default space in an embodiment of the invention;

FIG. 3 is a schematic diagram of an implementation process of determining a target controlled device in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the embodiment of the invention, at least one electronic device, such as a smart phone, a smart watch, a tablet computer, a notebook computer, a smart air conditioner, a network camera, a smart sound box and the like, can exist in the same preset space, such as a living room, a bedroom, an office, a carriage and the like, and a user can control the electronic devices by inputting a voice instruction. However, since a plurality of electronic devices are stored in the same space, it may happen that the user actually wants to control the device a, but the device B responds to the user's voice command, resulting in an erroneous operation.

In order to solve the problem, in the embodiment of the present invention, electronic devices in the same preset space may be divided into a collecting device and a controlled device according to whether the electronic devices themselves have an image collecting function, where the collecting device is configured to collect images in the preset space, the controlled device is configured to respond to a voice instruction of a user, and the controlled device has a voice collecting function. In practical application, the same electronic device can be an acquisition device and a passive device, such as a smart television, a network camera, a smart phone, a smart watch and the like; of course, the same electronic device may also be only a controlled device, such as an intelligent sound box, an intelligent air conditioner, and the like, and the embodiment of the present invention is not limited specifically.

In practical applications, the electronic devices in the same preset space may communicate with each other directly or indirectly. For example, the electronic devices may log in a backend server by using the same user account, and then communicate through the backend server; or, in order to improve the communication efficiency, the electronic devices may also communicate through an intelligent gateway disposed in the preset space; in order to further improve the communication efficiency, these electronic devices may also communicate via Wireless communication technologies such as Zigbee (Zigbee protocol), Wi-Fi (Wireless-Fidelity), and the like. Of course, other communication modes may be adopted between the electronic devices, and the embodiment of the present invention is not particularly limited.

Further, an embodiment of the present invention provides a voice control method, which may be applied to a voice control apparatus, where the voice control apparatus may be applied to the above background server, the intelligent gateway, the acquisition device, or the controlled device.

Fig. 1 is a schematic flow chart of an implementation of a voice control method in an embodiment of the present invention, and referring to fig. 1, the method includes:

s101: obtaining a user posture image;

the user posture image is acquired by at least one acquisition device in a preset space at a first moment;

here, when the user performs voice control in the preset space at the first time, the user speaks a voice instruction, and at this time, the electronic device with a voice acquisition function in the preset space, that is, the controlled device, receives the voice instruction. If the controlled equipment and the acquisition equipment are the same equipment, the controlled equipment controls the controlled equipment to acquire the image in the preset space at the first moment. If the controlled equipment and the acquisition equipment are different equipment, the controlled equipment sends an acquisition instruction to the acquisition equipment, and the acquisition equipment responds to the acquisition instruction to acquire an image in a preset space after receiving the acquisition instruction. Since the user is in the preset space at the first moment, the image acquired by the acquisition device may include an image of the user, that is, a user posture image. And finally, the acquisition equipment can send the acquired user posture image to the voice control device, and at the moment, the voice control device acquires the user posture image.

S102: determining target controlled equipment which is intended to be controlled by a user from at least one controlled equipment in a preset space according to the user posture image;

here, the voice control apparatus recognizes the user posture image after obtaining the user posture image, thereby obtaining posture information of the user, such as a body angle of the user, a face angle of the user, and/or a line-of-sight angle of the user. Then, according to the posture information of the user, the direction faced by the user at the first moment can be determined, and further a controlled device faced by the user is determined, wherein the controlled device is a target controlled device which is intended to be controlled by the user.

S103: the control target controlled device responds to a voice instruction input by a user at a first moment.

Here, after the target controlled device is determined, the voice control apparatus may control the target controlled device to respond to the voice instruction input by the user at the first time. At this time, if the voice control device is applied to the background server or the intelligent gateway, the voice control device sends a control instruction to the target controlled device, and the controlled device executes the control instruction and responds to the voice instruction input by the user received by the controlled device at the first moment; if the voice control device is applied to equipment different from the target controlled equipment, the voice control device also sends a control instruction to the target controlled equipment, and the controlled equipment executes the control instruction and responds to a voice instruction input by a user and received by the controlled equipment at the first moment; and if the voice control device is applied to the target controlled equipment, the voice control device controls the voice control device to respond to the voice command input by the user and received by the voice control device at the first moment.

The method of voice control described above is explained below by way of specific examples.

For example, fig. 2 is a schematic diagram of a preset space in an embodiment of the present invention, and referring to fig. 2, a network camera 201, a smart television 202, and a smart sound box 203 are installed in the preset space 200, wherein cameras are installed on the network camera 201 and the smart television 202, and microphones are installed on the network camera 201, the smart television 202, and the smart sound box 203.

Then, first, user 204 stands in preset space 200 and inputs a voice command toward smart sound box 203, such as "how is today weather? At this time, the smart television 202 and the smart sound box 203 can both receive a voice instruction input by a user, then the smart television 202 and/or the smart sound box sends an acquisition instruction to all acquisition devices, then, the network camera 201 and the smart television 202 perform image acquisition through their respective cameras, the network camera 201 sends an acquired user posture image to the smart television 202, the smart television 202 recognizes the posture of the user according to the user posture image acquired by itself and the network camera, obtains posture information of the user, and determines that the user faces the smart sound box 203 according to the posture information, thereby determining that the smart sound box is a target controlled device that the user intends to control, and then, the smart television 202 controls the smart sound box 203 to respond to the "how is the weather today" input by the user at the first time? "this voice command, smart speaker 203 obtains" how do the weather today by voice recognition? "reply sentence" of "today Beijing, fine, 35 ℃ to 25 ℃", and convert this sentence into voice signal output.

Thus, the voice control process of the user is completed.

Based on the foregoing embodiment, fig. 3 is a schematic diagram of an implementation process of determining a target controlled device in the embodiment of the present invention, and as shown in fig. 3, the step S102 may specifically include:

s301: determining a body angle of the user, a face angle of the user and/or a sight line angle of the user according to the user posture image;

s302: and determining a target controlled device facing the user in at least one controlled device as a target controlled device according to the body angle of the user, the face angle of the user and/or the sight line angle of the user.

In a specific implementation process, the voice control device may analyze the gesture image of the user by using an image analysis algorithm to determine whether the user faces a certain acquisition device, and further determine a target controlled device actually faced by the user.

Specifically, the angle of the user's body, the angle of the face, and/or the angle of line of sight are determined from the user's gesture image. For example, there are various face angles of face images: the front face is called a tenth face, and the shot objects are symmetrical on the picture; a face of the ninth facet faces to form a side angle of 18 degrees with the shooting lens; the octant surface is a side angle of 36 degrees; the seventh surface is a side angle of 54 degrees; the sixth facet is a 72-degree side angle; the pentahedron is a 90-degree side angle, and at this time, the shot object only shows one eye on the picture, namely the whole side. The image analysis algorithm can be used for analyzing the symmetrical conditions of the left side and the right side in the face image, and further judging whether the angle of the face is basically over against the equipment. A user is considered to have control intent over the device if the user's body angle, face angle and/or gaze angle are substantially directly opposite the device.

In other embodiments of the present invention, the angle of the body of the user, the angle of the face and/or the angle formed by the line-of-sight angle and the photographing lens may be identified, the direction of the plane of the body of the user, the direction of the plane of the face and/or the angle formed by the direction of the plane perpendicular to the line-of-sight angle and the direction of the plane of the photographing lens is used as the orientation angle of the user, when the orientation angle of the user is less than or equal to a preset angle threshold, the angle of the user is considered to be substantially opposite to the device, and the angle threshold may be set to be 5 ° to 10 °.

If the difference between any two of the body orientation angle, the face orientation angle, and the gaze orientation angle is greater than or equal to the preset angle difference threshold, it is determined that the user has a control intention for the controlled device if the gaze orientation angle is less than or equal to the preset angle threshold. If the three angles of the body, face and line of sight are different from the angle formed by the imaging lens, the angle formed by the line of sight and the imaging lens is preferred as the orientation angle of the user because the line of sight reflects the most conscious intention of a person, and when the angle data are not the same, the line of sight is preferred, and the control intention of the user can be recognized most accurately.

Further, under the condition that the sight angle of the user cannot be determined according to the acquired gesture image of the user, and under the condition that the orientation angle of the face is smaller than or equal to a preset angle threshold value, the fact that the user has control intention on the controlled device is judged. If the angle formed by the sight angle and the shooting lens cannot be identified due to light and the like; taking the angle formed by the angle of the face and the shooting lens as the orientation angle of the user; if the angle formed by the angle of the face and the shooting lens cannot be identified due to light and the like; the angle formed by the body angle and the shooting lens is taken as the orientation angle of the user. That is, the priority of the three kinds of angle data is in the order of the sight line angle, the face angle, and the body angle of the user, and the data having the higher priority reflects the control intention of the user more accurately.

Of course, the voice control apparatus may also determine the target controlled device according to the user gesture image by using other methods, which is not limited in the embodiment of the present invention.

In an embodiment of the present invention, in order to reduce the amount of analysis calculation of the user gesture image by the voice control apparatus, S101 may include: receiving at least one image from at least one acquisition device; determining an image with a time stamp at a first moment from at least one image, performing target detection on the image with the time stamp at the first moment according to a preset target user model, and determining an image containing a target user, wherein the target user is a user of at least one controlled device; and determining the determined image containing the target user as a user posture image.

Specifically, the voice control device receives at least one image from at least one acquisition device, selects an image with a time stamp as a first moment from the images, performs target detection on the image with the time stamp as the first moment according to a preset target user model, determines an image containing a target user, wherein the target user can be a preset user having the right to use at least one controlled device, and finally determines the determined image containing the target user as a user posture image.

Or, in other embodiments of the present invention, if the controlled device and the acquiring device are the same device, in this case, S101 may include: acquiring a voice instruction input by a user at a first moment; identifying a user inputting a voice instruction according to a preset user voiceprint model; and when the user is identified to be a legal user, acquiring a user posture image.

Here, the capturing device may identify the user before capturing the user posture image. That is to say, after receiving a voice instruction input by a user at a first time, the acquisition device performs voiceprint recognition on the input voice instruction according to a preset user voiceprint model, determines the identity of the user inputting the voice instruction, controls the acquisition device to acquire a user posture image when the recognized user is a legal user, and does not respond any more when the recognized user is an illegal user, so that the controlled device only responds to the voice instruction of a specific user, and misoperation of other users is avoided.

Further, in the above case, that is, the controlled device and the acquiring device are the same device, S103 may include: and performing voice recognition on the voice command, responding to the voice command, and executing corresponding target operation.

The controlled device carries out voice recognition on the voice command received by the controlled device and input by the user, and then responds to the recognized voice command to execute the target operation corresponding to the voice command.

Based on the same inventive concept, embodiments of the present invention provide a voice control apparatus, which is consistent with the voice control apparatus described in one or more of the above embodiments.

Fig. 4 is a schematic structural diagram of a voice control apparatus according to an embodiment of the present invention, and referring to fig. 4, the voice control apparatus 400 includes: an obtaining unit 401, configured to obtain a user posture image, where the user posture image is acquired by at least one acquiring device located in a preset space at a first time; a determining unit 402, configured to determine, according to the user gesture image, a target controlled device that is intended to be controlled by the user from at least one controlled device in the preset space; and a control unit 403, configured to control the target controlled device to respond to the voice instruction input by the user at the first time.

In an embodiment of the present invention, the determining unit is specifically configured to determine a body angle of the user, a face angle of the user, and/or a sight angle of the user according to the user posture image; and determining a target controlled device facing the user in at least one controlled device as a target controlled device according to the body angle of the user, the face angle of the user and/or the sight line angle of the user.

In an embodiment of the present invention, the obtaining unit is specifically configured to receive at least one image from at least one acquiring device; determining an image with a time stamp of a first moment from at least one image; according to a preset target user model, performing target detection on the image with the time stamp being the first moment, and determining the image containing a target user, wherein the target user is a user of at least one controlled device; and determining the determined image containing the target user as a user posture image.

In this embodiment of the present invention, the control unit is specifically configured to send a control instruction to the target controlled device, where the control instruction is used to instruct the target controlled device to respond to a voice instruction input by a user at a first time.

In an embodiment of the present invention, when the voice control apparatus is set on a target controlled device, the obtaining unit is further configured to obtain a voice instruction input by a user at a first time; identifying a user inputting a voice instruction according to a preset user voiceprint model; and when the user is identified to be a legal user, acquiring a user posture image.

Further, the control unit is specifically configured to perform voice recognition on the voice command; and responding to the voice instruction, and executing corresponding target operation.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus according to the invention, reference is made to the description of the embodiments of the method according to the invention for understanding.

Based on the same inventive concept, embodiments of the present invention provide an electronic device, which is the same as the electronic device described in one or more embodiments above.

Fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention, and referring to fig. 5, the electronic device 500 includes: at least one processor 501; and at least one memory 502, bus 503 connected to processor 501; the processor 501 and the memory 502 complete communication with each other through the bus 503; processor 501 is configured to call program instructions in memory 502 to perform the method steps of speech control described in one or more of the embodiments above.

Here, it should be noted that: the above description of the embodiments of the electronic device is similar to the above description of the embodiments of the apparatus, and has similar advantageous effects to the embodiments of the apparatus. For technical details not disclosed in the embodiments of the electronic device of the present invention, reference should be made to the description of the embodiments of the apparatus of the present invention for understanding.

Based on the same inventive concept, embodiments of the present invention provide a computer-readable storage medium storing computer instructions for causing a computer to perform the method steps of voice control described in one or more of the above embodiments.

As can be seen from the above description, in the embodiment of the present invention, the target controlled device that the user intends to control is determined in the multiple controlled devices according to the user posture image, and the target controlled device is further controlled to respond to the voice command of the user, so that the misoperation caused by the multiple devices responding to the voice command of the user at the same time is avoided.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a PLM plug-in for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction PLM plug-in components that implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A voice control method is characterized in that the method is applied to a voice control device, the voice control device is applied to at least one acquisition device or at least one controlled device, the at least one acquisition device and the at least one controlled device log in a background server by adopting the same user account, and communication is carried out through the background server; the method comprises the following steps:

obtaining a user posture image, wherein the user posture image is acquired by at least one acquisition device positioned in a preset space at a first moment;

determining target controlled equipment which is intended to be controlled by the user from at least one controlled equipment in the preset space according to the user posture image, wherein the at least one acquisition equipment and the at least one controlled equipment are the same equipment;

controlling the target controlled equipment to respond to a voice instruction input by the user at the first moment;

the obtaining of the user gesture image comprises:

receiving at least one image from the at least one acquisition device;

determining an image with a timestamp of the first time from the at least one image;

according to a preset target user model, performing target detection on the image with the timestamp of the first moment, and determining an image containing a target user, wherein the target user is a user having the use authority on the at least one controlled device;

and determining the determined image containing the target user as the user posture image.

2. The method according to claim 1, wherein the determining, according to the user gesture image, a target controlled device which is intended to be controlled by the user from at least one controlled device in the preset space comprises:

determining a body angle of the user, a face angle of the user and/or a sight line angle of the user according to the user posture image;

and determining a target controlled device facing the user in the at least one controlled device as the target controlled device according to the body angle of the user, the face angle of the user and/or the sight line angle of the user.

3. The method of claim 1, wherein the controlling the target controlled device in response to the voice command input by the user at the first time comprises:

and sending a control instruction to the target controlled device, wherein the control instruction is used for indicating the target controlled device to respond to the voice instruction input by the user at the first moment.

4. A voice control device is characterized in that the voice control device is applied to at least one acquisition device or at least one controlled device, the at least one acquisition device and the at least one controlled device log in a background server by adopting the same user account, and communication is carried out through the background server; the device comprises:

an obtaining unit, configured to obtain a user posture image, where the user posture image is acquired at a first time by at least one acquisition device located in a preset space;

the determining unit is used for determining target controlled equipment which is intended to be controlled by the user from at least one controlled equipment in the preset space according to the user posture image, and the at least one collecting equipment and the at least one controlled equipment are the same equipment;

the control unit is used for controlling the target controlled equipment to respond to the voice instruction input by the user at the first moment;

the obtaining unit is specifically configured to receive at least one image from at least one acquisition device; determining an image with a time stamp of a first moment from at least one image; according to a preset target user model, performing target detection on the image with the time stamp as the first moment, and determining the image containing a target user, wherein the target user is a user having the use authority on at least one controlled device; and determining the determined image containing the target user as a user posture image.

5. The apparatus according to claim 4, wherein the determining unit is specifically configured to determine, from the user gesture image, a body angle of the user, a face angle of the user, and/or a gaze angle of the user; and determining a target controlled device facing the user in the at least one controlled device as the target controlled device according to the body angle of the user, the face angle of the user and/or the sight line angle of the user.

6. An electronic device, comprising:

at least one processor; at least one memory; a bus;

wherein,

the processor and the memory complete mutual communication through the bus;

the processor is configured to invoke program instructions in the memory to perform the voice-controlled method of any of claims 1 to 3.

7. A computer-readable storage medium storing computer instructions for causing a computer to perform the voice-controlled method of any one of claims 1 to 3.