WO2022141225A1

WO2022141225A1 - Methods, apparatus, and systems for operating device based on speech command

Info

Publication number: WO2022141225A1
Application number: PCT/CN2020/141518
Authority: WO
Inventors: Junfeng Wu; Shicheng Zhou; Yunfeng Bian
Original assignee: SZ DJI Technology Co., Ltd.
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07
Also published as: CN116710889A

Abstract

Method, apparatus, and non-transitory computer-readable medium for operating a device are provided, the method including receiving a speech command associated with operating the device. The method also includes determining an operation mode in which the device currently operates. The operation mode is associated with a speaker's authorization to control at least one function of the device. The method further includes causing the device to operate in accordance with the determined operation mode.

Description

[Title established by the ISA under Rule 37.2] METHODS, APPARATUS, AND SYSTEMS FOR OPERATING DEVICE BASED ON SPEECH COMMAND

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

TECHNICAL FIELD

The present disclosure relates generally to operation of devices and, more particularly, to methods, systems, and apparatus for operating devices based on sensory data, such as speech recognition.

BACKGROUND

Operation of a device or a system including multiple devices may be accessible to multiple users. For example, devices like camera, movable objects, gimbal, smart wearing device, assistant Robert have variety scenarios manipulating by users. Movable objects, such as unmanned aerial vehicles ( “UAVs” ) , sometimes also referred to as “drones, ” include pilotless aircraft of various sizes and configurations that can be remotely operated by a user and/or programmed for automated flight. UAVs can be equipped with one or more sensors (e.g., cameras, radar, audio sensors, etc. ) to gather information for various purposes including, but not limited to, recreation, surveillance, sports, aerial photography, navigation, positioning, and user interactions. Recent technological developments provide improved user experience in user interaction with the UAV, but may also present additional challenges, such as receiving false information or unauthorized command, and causing safety and security concerns.

Therefore, there exists a need for a system, an apparatus, and a method for operating a device based on sensory data with improved user interaction, improved user experience, and enhanced safety and security.

SUMMARY

Consistent with embodiments of the present disclosure, a method is provided for operating a device. The method includes receiving a speech command associated with operating the device. The method also includes determining an operation mode in which the device currently operates. The operation mode is associated with a speaker’s authorization to control at least one function of the device. The method further includes causing the device to operate in accordance with the determined operation mode.

There is also provided an apparatus configured to operate a device. The apparatus includes one or more processors, and memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to perform operations including receiving a speech command associated with operating the device. The apparatus is also caused to perform operations including determining an operation mode in which the device currently operates. The operation mode is associated with a speaker’s authorization to control at least one function of the device. The apparatus is also caused to perform operations including causing the device to operate in accordance with the determined operation mode.

There is further provided a non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising receiving a speech command associated with operating the device. The operations further include determining an operation mode in which the device currently operates. The operation mode is associated with a speaker’s authorization to control at least one function of the device. The operations further include causing the device to operate in accordance with the determined operation mode.

There is also provided a method for operating a device, the method including determining whether the device is in a first or a second operation mode associated with a speaker’s authorization to operate the device. The first operation mode permits control of at least one function associated with the device only by an authorized user. The second operation mode permits control of any function associated with the device by any user. The method further includes causing the device to operate in accordance with the determined first or second operation mode. Upon determining the device is in the first operation mode, the method includes identifying the authorized user; and operating the device in accordance with a first instruction spoken by the identified authorized user. Upon determining that the device is in the second operation mode, the method includes receiving a second instruction; and operating the device in accordance with the received second instruction.

There is also provided an apparatus configured to operate a device. The apparatus includes one or more processors, and memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to perform operations including determining whether the device is in a first or a second operation mode associated with a speaker’s authorization to operate the device. The first operation mode permits control of at least one function associated with the device only by an authorized user. The second operation mode permits control of any function associated with the device by any user. The apparatus is also caused to perform operations including causing the device to operate in accordance with the determined first or second operation mode. Upon determining the device is in the first operation mode, the apparatus is caused to perform operations including identifying the authorized user; and operating the device in accordance with a first instruction spoken by the identified authorized user. Upon determining that the device is in the second operation mode, the apparatus is caused to perform operations including receiving a second instruction; and operating the device in accordance with the received second instruction.

There is further provided a non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising determining whether the device is in a first or a second operation mode associated with a speaker’s authorization to operate the device. The first operation mode permits control of at least one function associated with the device only by an authorized user. The second operation mode permits control of any function associated with the device by any user. The operations further include causing the device to operate in accordance with the determined first or second operation mode. Upon determining the device is in the first operation mode, the operations include identifying the authorized user; and operating the device in accordance with a first instruction spoken by the identified authorized user. Upon determining that the device is in the second operation mode, the operations include receiving a second instruction; and operating the device in accordance with the received second instruction.

There is also provided a method for switching between specific and non-specific speech recognition, the method including receiving a speech command associated with a first person; receiving auxiliary information associated with a second person; determining whether the first and second person are the same person based on the received speech and auxiliary information; and deciding whether to accept the speech command based on the determination whether the first and second person are the same person.

There is also provided an apparatus configured to switch between specific and non-specific speech recognition. The apparatus includes one or more processors, and memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to perform operations including receiving a speech command associated with a first person; receiving auxiliary information associated with a second person; determining whether the first and second person are the same person based on the received speech and auxiliary information; and deciding whether to accept the speech command based on the determination whether the first and second person are the same person.

There is further provided a non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising receiving a speech command associated with a first person; receiving auxiliary information associated with a second person; determining whether the first and second person are the same person based on the received speech and auxiliary information; and deciding whether to accept the speech command based on the determination whether the first and second person are the same person.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed. Other features and advantages of the present invention will become apparent by a review of the specification, claims, and appended figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example environment for operating a device, such as a movable object, in accordance with embodiments of the present disclosure.

FIG. 2 shows an example block diagram of an apparatus configured in accordance with embodiments of the present disclosure.

FIG. 3 shows a flow diagram of example processes of training and using speech recognition models for processing audio signals to operate a device in accordance with embodiments of the present disclosure.

FIG. 4 shows a flow diagram of an example process of performing speaker recognition in accordance with embodiments of the present disclosure.

FIG. 5 shows a flow diagram of an example process of operating a device based on speech commands in accordance with embodiments of the present disclosure.

FIGs. 6A-6B show examples of controlling a device via speech commands alone or in combination with image recognition based on one or more images captured by an image sensor of the device in accordance with embodiments of the present disclosure.

FIGs. 7A-7B show examples of controlling a device via speech commands and image recognition based on one or more images captured by an image sensor of the device in accordance with embodiments of the present disclosure.

FIG. 8 shows a flow diagram of an example process of operating a device based on a speech command in accordance with embodiments of the present disclosure.

FIG. 9 shows a flow diagram of an example process of operating a device in accordance with embodiments of the present disclosure.

DETAILED DESCRIPTION

The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers refer to the same or similar parts. While several illustrative embodiments are described herein, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the components illustrated in the drawings. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples. Instead, the proper scope is defined by the appended claims.

Operation of a device at the presence of multiple people, or operation of a system including multiple devices, may provide access to multiple people, including authorized or intended users as well as unauthorized or unintended users. In some embodiments, voice interaction can be used for human-computer interaction. Non-specific person speech recognition (e.g., accepting speech commands from users without considering whether a user is permitted to give a command) can be used to operate a device, such as a UAV, gimbal, camera, smart wearing device, assistant Robert. For example, as disclosed herein, the device may include or be communicatively coupled with hardware that can be controlled by anyone using speech commands. However, the use of non-specific person speech recognition may be disturbed by multiple intentional voice commands (e.g., detecting speech commands from multiple people who may or may not have permission to control) or unintentional voice commands (e.g., people chatting in the background, etc. ) . For example, when an owner of a device operates the device with speech commands, someone else may give a voice command or speak in the background, including command words such as “stop. ” The device may stop in response to detecting the “stop” command, which may interrupt the owner′scontrol process, and may thereby cause safety issues.

As discussed herein, devices that are equipped with or communicatively coupled to hardware and software with speech recognition functions may additionally or alternatively be controlled using specific person speech recognition (e.g., an operation mode in which only speech commands from user (s) who have permission to operate can be accepted) , which may be safe to use as the device recognizes only the voice commands (may be also referred to as speech commands, voice instructions, commands, instructions) of a user with permission to operate, such as the owner. However, such a specific person control mode may not be preferred because it exclusively and unnecessarily limits operation of the device to the owner (s) . For example, during a group activity, it may be desirable for some non-critical functions or group functions associated with the device to be controllable by all participants (e.g., in a non-specific person control mode) .

In consideration of the different operation modes discussed above, the present disclosure provides methods, apparatus, and systems for operating a device based on speech recognition, and can further switch between different operation modes, such as between a specific person recognition mode and a non-specific person recognition mode. As disclosed herein, the systems and methods may take into account the possibility and convenience of having different users operate the device. Meanwhile, the present disclosure also provides an efficient and effective way to control and manage operation authority and thus enables improved safety when the device is operated under some scenarios.

Consistent with embodiments of the present disclosure, there are provided methods, apparatus, and systems for operating a device in accordance with sensory data, such as audio signals that may be detected by an audio sensor system onboard the device. The audio signals may include speech command (s) . The audio signals may be detected and collected by one or more sensors onboard the device. The collected audio signals may be analyzed to identify the speech command (s) associated with operating the device. The speech command (s) may also be analyzed to identify the speaker (e.g., also referred to as user or operator herein) of the speech command (s) . The methods and systems as discussed herein may also determine whether the identified speaker is authorized (e.g., owns the device, is pre-registered to operate the device, has been given the authority to operate the device by the owner, etc. ) to operate the device or at least one or more functions associated with components of the device (e.g., camera functions, motion functions, etc. ) . Operating the device based on speech recognition can provide improved user experience. Monitoring and managing (e.g., switching between, automatically controlling, etc. ) operation modes associated with speaker authority of giving speech commands to operate the device can also improve safety and security and avoid false operation of the device. For example, the method, apparatus, and system disclosed herein can recognize voice commands (e.g., speech commands) sent by any person when operating in a non-specific person recognition mode, thereby providing convenience and improved user experience. Upon switching to a specific person recognition mode, only the speech command (s) from an authorized user, such as an owner, can be recognized, thereby improving the safety and security of operation the device.

FIG. 1 shows an example environment 100 for operating a device, provided as an unmanned aerial vehicle ( “UAV” ) 102, in accordance with embodiments of the present disclosure. It is appreciated that UAV is provided as an example for illustration purpose throughout the disclosure and figures, and not intended to be limiting. Any other suitable devices, movable objects, and/or systems can be used and are included within the scope of the present disclosure.

In some embodiments, environment 100 includes UAV 102 that is capable of communicatively connecting to one or more electronic devices including a remote control 130 (also referred to herein as a terminal 130) , a mobile device 140, and a server 110 (e.g., cloud-based server) via a network 120 in order to exchange information with one another and/or other additional devices and systems. In some embodiments, network 120 may be any combination of wired and wireless local area network (LAN) and/or wide area network (WAN) , such as an intranet, an extranet, and the internet. In some embodiments, network 120 is capable of providing communications between one or more electronic devices as discussed in the present disclosure. In some embodiments, UAV 102 is capable of transmitting data (e.g., image data, audio data, and/or motion data) detected by one or more sensors onboard UAV 102 (e.g., an image sensor 107, an audio sensor 174, and/or inertial measurement unit (IMU) sensors included in a sensing system 172) in real-time during movement of UAV 102 and via network 120 to remote control 130, mobile device 140, and/or server 110 that are configured to process the data. For example, audio sensor 174 onboard UAV 102 may detect audio data containing speech commands spoken by one or more people in the surrounding environment. The detected audio data may be processed by UAV 102. The detected audio data may also be transmitted from UAV 102 in real-time to remote control 130, mobile device 140, and/or server 110 for processing. In some embodiments, operation instructions for controlling UAV 102 can be generated in accordance with the speech commands contained in the detected audio data. In some embodiments, audio data containing speech commands from the environment may also be detected by device (s) other than UAV 102, such as audio sensor (s) of remote control 130 or mobile device 140 (e.g., which may be closer to the speaker (s) of the speech commands) . The detected audio data may be processed by the receiving device (e.g., remote control 130 or mobile device 140) , or transmitted to a different device for processing. For example, the audio data may be detected by mobile device 140, and transmitted to related modules onboard UAV 102 for processing. In some embodiments, the processed data and/or operation instructions can be communicated in real-time with each other among UAV 102, remote control 130, mobile device 140, and/or cloud-based server 110 via network 120. For example, operation instructions (e.g., generated based on speech commands) can be transmitted from remote control 130, mobile device 140, and/or cloud-based server 110 to UAV 102 in real-time to control the flight of UAV 102 and components thereof. In some embodiments, any suitable communication techniques can be implemented by network 120, such as local area network (LAN) , wide area network (WAN) (e.g., the Internet) , cloud environment, telecommunications network (e.g., 3 G, 4G, 5G) , WiFi, Bluetooth, radiofrequency (RF) , infrared (IR) , or any other communications technique.

While environment 100 is configured for operating a movable object provided as UAV 102, the movable object could instead be provided as any other suitable object, device, mechanism, system, or machine configured to travel on or within a suitable medium (e.g., surface, air, water, rails, space, underground, etc. ) . The movable object may also be other types of movable object (e.g., wheeled objects, nautical objects, locomotive objects, other aerial objects, etc. ) . For illustrative purpose, in the present disclosure, UAV 102 refers to an aerial device configured to be operated and/or controlled automatically or autonomously based on commands detected by one or more sensors (e.g., image sensor 107, an audio sensor 174, an ultrasonic sensor, and/or a motion sensor of sensing system 172, etc. ) onboard UAV 102 or via an electronic control system (e.g., with pre-programed instructions for controlling UAV 102) . Alternatively or additionally, UAV 102 may be configured to be operated and/or controlled manually by an off-board operator (e.g., via remote control 130 or mobile device 140 as shown in FIG. 1) .

UAV 102 includes one or more propulsion devices 104 and may be configured to carry a payload 108 (e.g., an image sensor) . Payload 108 may be connected or attached to UAV 102 by a carrier 106, which may allow for one or more degrees of relative movement between payload 108 and UAV 102. Payload 108 may also be mounted directly to UAV 102 without carrier 106. In some embodiments, UAV 102 may also include sensing system 172, a communication system 178, and an onboard controller 176 in communication with the other components.

UAV 102 may include one or more (e.g., 1, 2, 3, 3, 4, 5, 10, 15, 20, etc. ) propulsion devices 104 positioned at various locations (for example, top, sides, front, rear, and/or bottom of UAV 102) for propelling and steering UAV 102. Propulsion devices 104 are devices or systems operable to generate forces for sustaining controlled flight. Propulsion devices 104 may share or may each separately include or be operatively connected to a power source, such as a motor (e.g., an electric motor, hydraulic motor, pneumatic motor, etc. ) , an engine (e.g., an internal combustion engine, a turbine engine, etc. ) , a battery bank, etc., or a combination thereof. Each propulsion device 104 may also include one or more rotary components drivably connected to a power source (not shown) and configured to participate in the generation of forces for sustaining controlled flight. For instance, rotary components may include rotors, propellers, blades, nozzles, etc., which may be driven on or by a shaft, axle, wheel, hydraulic system, pneumatic system, or other component or system configured to transfer power from the power source. Propulsion devices 104 and/or rotary components may be adjustable (e.g., tiltable) with respect to each other and/or with respect to UAV 102. Alternatively, propulsion devices 104 and rotary components may have a fixed orientation with respect to each other and/or UAV 102. In some embodiments, each propulsion device 104 may be of the same type. In other embodiments, propulsion devices 104 may be of multiple different types. In some embodiments, all propulsion devices 104 may be controlled in concert (e.g., all at the same speed and/or angle) . In other embodiments, one or more propulsion devices may be independently controlled with respect to, e.g., speed and/or angle.

Propulsion devices 104 may be configured to propel UAV 102 in one or more vertical and horizontal directions and to allow UAV 102 to rotate about one or more axes. That is, propulsion devices 104 may be configured to provide lift and/or thrust for creating and maintaining translational and rotational movements of UAV 102. For instance, propulsion devices 104 may be configured to enable UAV 102 to achieve and maintain desired altitudes, provide thrust for movement in all directions, and provide for steering of UAV 102. In some embodiments, propulsion devices 104 may enable UAV 102 to perform vertical takeoffs and landings (i.e., takeoff and landing without horizontal thrust) . Propulsion devices 104 may be configured to enable movement of UAV 102 along and/or about multiple axes.

In some embodiments, payload 108 includes one or more sensory devices. The sensory devices may include devices for collecting or generating data or information, such as surveying, tracking, operation command, and capturing images or video of targets (e.g., objects, landscapes, subjects of photo or video shoots, etc. ) . The sensory device may include image sensor 107 configured to gather data that may be used to generate images. As disclosed herein, image data obtained from image sensor 107 may be processed and analyzed to obtain commands and instructions from one or more users to operate UAV 102 and/or image sensor 107. In some embodiments, image sensor 107 may include photographic cameras, video cameras, infrared imaging devices, ultraviolet imaging devices, x-ray devices, ultrasonic imaging devices, radar devices, etc. The sensory devices may also include devices, such as audio sensor 174, for capturing audio data (e.g., including speech data 152 as shown in FIG. 1) , such as microphones or ultrasound detectors. Audio sensor 174 may be included or integrated in image sensor 107. Audio sensor 174 may also be held by payload 108, but separate and independent from image sensor 107. The sensory devices may also or alternatively include other suitable sensors for capturing visual, audio, and/or electromagnetic signals.

Carrier 106 may include one or more devices configured to hold payload 108 and/or allow payload 108 to be adjusted (e.g., rotated) with respect to UAV 102. For example, carrier 106 may be a gimbal. Carrier 106 may be configured to allow payload 108 to be rotated about one or more axes, as described below. In some embodiments, carrier 106 may be configured to allow payload 108 to rotate about each axis by 360° to allow for greater control of the perspective of payload 108. In other embodiments, carrier 106 may limit the range of rotation of payload 108 to less than 360° (e.g., ≤ 270°, ≤ 210°, ≤ 180, ≤ 120°, ≤ 90°, ≤ 45°, ≤ 30°, ≤ 15°, etc. ) about one or more of its axes.

Carrier 106 may include a frame assembly, one or more actuator members, and one or more carrier sensors. The frame assembly may be configured to couple payload 108 to UAV 102 and, in some embodiments, to allow payload 108 to move with respect to UAV 102. In some embodiments, the frame assembly may include one or more sub-frames or components movable with respect to each other. The actuator members (not shown) are configured to drive components of the frame assembly relative to each other to provide translational and/or rotational motion of payload 108 with respect to UAV 102. In other embodiments, actuator members may be configured to directly act on payload 108 to cause motion of payload 108 with respect to the frame assembly and UAV 102. Actuator members may be or may include suitable actuators and/or force transmission components. For example, actuator members may include electric motors configured to provide linear and/or rotational motion to components of the frame assembly and/or payload 108 in conjunction with axles, shafts, rails, belts, chains, gears, and/or other components.

The carrier sensors (not shown) may include devices configured to measure, sense, detect, or determine state information of carrier 106 and/or payload 108. State information may include positional information (e.g., relative location, orientation, attitude, linear displacement, angular displacement, etc. ) , velocity information (e.g., linear velocity, angular velocity, etc. ) , acceleration information (e.g., linear acceleration, angular acceleration, etc. ) , and/or other information relating to movement control of carrier 106 or payload 108, either independently or with respect to UAV 102. The carrier sensors may include one or more types of suitable sensors, such as potentiometers, optical sensors, vision sensors, magnetic sensors, motion or rotation sensors (e.g., gyroscopes, accelerometers, inertial sensors, etc. ) . The carrier sensors may be associated with or attached to various components of carrier 106, such as components of the frame assembly or the actuator members, or to UAV 102. The carrier sensors may be configured to communicate data and information with onboard controller 176 of UAV 102 via a wired or wireless connection (e.g., RFID, Bluetooth, Wi-Fi, radio, cellular, etc. ) . Data and information generated by the carrier sensors and communicated to onboard controller 176 may be used by onboard controller 176 for further processing, such as for determining state information of UAV 102 and/or targets.

Carrier 106 may be coupled to UAV 102 via one or more damping elements (not shown) configured to reduce or eliminate undesired shock or other force transmissions to payload 108 from UAV 102. The damping elements may be active, passive, or hybrid (i.e., having active and passive characteristics) . The damping elements may be formed of any suitable material or combinations of materials, including solids, liquids, and gases. Compressible or deformable materials, such as rubber, springs, gels, foams, and/or other materials may be used as the damping elements. The damping elements may function to isolate payload 108 from UAV 102 and/or dissipate force propagations from UAV 102 to payload 108. The damping elements may also include mechanisms or devices configured to provide damping effects, such as pistons, springs, hydraulics, pneumatics, dashpots, shock absorbers, and/or other devices or combinations thereof.

Sensing system 172 of UAV 102 may include one or more onboard sensors (not shown) associated with one or more components or other systems. For instance, sensing system 172 may include sensors for determining positional information, velocity information, and acceleration information relating to UAV 102 and/or targets. In some embodiments, sensing system 172 may also include the above-described carrier sensors. Components of sensing system 172 may be configured to generate data and information for use (e.g., processed by the onboard controller or another device) in determining additional information about UAV 102, its components, and/or its targets. Sensing system 172 may include one or more sensors for sensing one or more aspects of movement of UAV 102. For example, sensing system 172 may include sensory devices associated with payload 108 as discussed above and/or additional sensory devices, such as a positioning sensor for a positioning system (e.g., GPS, GLONASS, Galileo, Beidou, GAGAN, RTK, etc. ) , motion sensors, inertial sensors (e.g., IMU sensors, MIMU sensors, etc. ) , proximity sensors, imaging device 107, etc. Sensing system 172 may also include sensors configured to provide data or information relating to the surrounding environment, such as weather information (e.g., temperature, pressure, humidity, etc. ) , lighting conditions (e.g., light-source frequencies) , air constituents, or nearby obstacles (e.g., objects, structures, people, other vehicles, etc. ) .

Communication system 178 of UAV 102 may be configured to enable communication of data, information, commands, and/or other types of signals between the onboard controller and one or more off-board devices, such as remote control 130, mobile device 140 (e.g., a mobile phone) , server 110 (e.g., a cloud-based server) , or another suitable entity. Communication system 178 may include one or more onboard components configured to send and/or receive signals, such as receivers, transmitter, or transceivers, that are configured for one-way or two-way communication. The onboard components of communication system 178 may be configured to communicate with off-board devices via one or more communication networks, such as radio, cellular, Bluetooth, Wi-Fi, RFID, and/or other types of communication networks usable to transmit signals indicative of data, information, commands, and/or other signals. For example, communication system 178 may be configured to enable communication with off-board devices for providing input for controlling UAV 102 during flight, such as remote control 130 and/or mobile device 140.

Onboard controller 176 of UAV 102 may be configured to communicate with various devices onboard UAV 102, such as communication system 178 and sensing system 172. Controller 176 may also communicate with a positioning system (e.g., a global navigation satellite system, or GNSS) to receive data indicating the location of UAV 102. Onboard controller 176 may communicate with various other types of devices, including a barometer, an inertial measurement unit (IMU) , a transponder, or the like, to obtain positioning information and velocity information of UAV 102. Onboard controller 176 may also provide control signals (e.g., in the form of pulsing or pulse width modulation signals) to one or more electronic speed controllers (ESCs) , which may be configured to control one or more of propulsion devices 104. Onboard controller 176 may thus control the movement of UAV 102 by controlling one or more electronic speed controllers. As disclosed herein, onboard controller 176 may further include circuits and modules configured to process speech recognition, image recognition, speaker identification, and/or other functions discussed herein.

The one or more off-board devices, such as remote control 130 and/or mobile device 140, may be configured to receive input, such as input from a user (e.g., user manual input, user speech input, user gestures captured by image sensor 107 and/or audio sensor 174 onboard UAV 102) , and communicate signals indicative of the input to controller 176. Based on the input from the user, the off-board device (s) may be configured to generate corresponding signals indicative of one or more types of information, such as control data (e.g., signals) for moving or manipulating UAV 102 (e.g., via propulsion devices 104) , payload 108, and/or carrier 106. The off-board device (s) may also be configured to receive data and information from UAV 102, such as data collected by or associated with payload 108 and operational data relating to, for example, positional data, velocity data, acceleration data, sensory data, and other data and information relating to UAV 102, its components, and/or its surrounding environment. As disclosed herein, the off-board device (s) may include remote control 130 with physical sticks, levers, switches, wearable apparatus, touchable display, and/or buttons configured to control flight parameters, and a display device configured to display image information captured by image sensor 107. The off-board device (s) may also include mobile device 140 including a display screen or a touch screen, such as a smartphone or a tablet, with virtual controls for the same purposes, and may employ an application on a smartphone or a tablet, or a combination thereof. Further, the off-board device (s) may include server system 110 communicatively coupled to a network 120 for communicating information with remote control 130, mobile device 140, and/or UAV 102. Server system 110 may be configured to perform one or more functionalities or sub- functionalities in addition to or in combination with remote control 130 and/or mobile device 140. The off-board device (s) may include one or more communication devices, such as antennas or other devices configured to send and/or receive signals. The off-board device (s) may also include one or more input devices configured to receive input (e.g., audio data containing speech commands, user input on a touch screen, etc. ) from a user, and generate an input signal communicable to onboard controller 176 of UAV 102 for processing to operate UAV 102. The off-board device (s) can also process the speech commands in the audio data locally to generate operation instructions, and then transmit the generated operation instructions to UAV 102 for controlling UAV 102. In addition to flight control inputs, the off-board device may be used to receive user inputs of other information, such as manual control settings, automated control settings, control assistance settings, and/or aerial photography settings. It is understood that different combinations or layouts of input devices for an off-board device are possible and within the scope of this disclosure.

The off-board device (s) may also include a display device configured to display information, such as signals indicative of information or data relating to movements of UAV 102 and/or data (e.g., imaging data) captured by UAV 102 (e.g., in conjunction with payload 106) . In some embodiments, the display device may be a multifunctional display device configured to display information as well as receive user input. In some embodiments, one of the off-board devices may include an interactive graphical interface (GUI) for receiving one or more user inputs. In some embodiments, the off-board device (s) , e.g., mobile device 140, may be configured to work in conjunction with a computer application (e.g., an “app” ) to provide an interactive interface on the display device or multifunctional screen of any suitable electronic device (e.g., a cellular phone, a tablet, etc. ) for displaying information received from UAV 102 and for receiving user inputs.

In some embodiments, the display device of remote control 130 or mobile device 140 may display one or more images received from UAV 102 (e.g., captured by image sensor 107 onboard UAV 102) . In some embodiments, UAV 102 may also include a display device configured to display images captured by image sensor 107. The display device on remote control 130, mobile device 140, and/or onboard UAV 102, may also include interactive means, e.g., a touchscreen, for the user to identify or select a portion of the image of interest to the user. In some embodiments, the display device may be an integral component, e.g., attached or fixed, to the corresponding device. In other embodiments, display device may be electronically connectable to (and dis-connectable from) the corresponding device (e.g., via a connection port or a wireless communication link) and/or otherwise connectable to the corresponding device via a mounting device, such as by a clamping, clipping, clasping, hooking, adhering, or other type of mounting device. In some embodiments, the display device may be a display component of an electronic device, such as remote control 130, mobile device 140 (e.g., a cellular phone, a tablet, or a personal digital assistant) , server system 110, a laptop computer, or other device.

In some embodiments, one or more electronic devices (e.g., UAV 102, server 110, remote control 130, or mobile device 140) as discussed with reference to FIG. 1 may have a memory and at least one processor and can be used to process image data obtained from one or more images captured by image sensor 107 onboard UAV 102 to identify a body indication of an operator, including one or more stationary bodily pose, attitude, or position identified in one image, or body movements determined based on a plurality of images. In some embodiments, the memory and the processor (s) of the multiple electronic devices as discussed herein may work independently or collaboratively with each other to process audio data (e.g., speech data 152) detected by audio sensor 174 onboard UAV 102, using speech recognition and/or speaker identification as discussed herein. In some embodiments, the memory and the processor (s) of the electronic device (s) are also configured to determine operation instructions corresponding to the recognized speech command from one or more operators according to the operation mode to control UAV 102 and/or image sensor 107. The electronic device (s) are further configured to transmit (e.g., substantially in real time with the flight of UAV 102) the determined operation instructions to related controlling and propelling components of UAV 102 and/or carrier 106, audio sensor 174, and/or image sensor 107 for corresponding control and operations.

FIG. 2 shows an example block diagram of an apparatus 200 configured in accordance with embodiments of the present disclosure. In some embodiments, apparatus 200 can be any one of the electronic devices as discussed in FIG. 1, such as UAV 102, remote control 130, mobile device 140, or server 110. Apparatus 200 includes one or more processors 202 for executing modules, programs and/or instructions stored in a memory 212 and thereby performing predefined operations, one or more network or other communications interfaces 208, and one or more communication buses 210 for interconnecting these components. Apparatus 200 may also include a user interface 203 comprising one or more input devices 204 (e.g., a keyboard, mouse, touchscreen) and one or more output devices 206 (e.g., a display or speaker) .

Processors 202 may be any suitable hardware processor, such as an image processor, an image processing engine, an image-processing chip, a graphics-processor (GPU) , a microprocessor, a micro-controller, a central processing unit (CPU) , a network processor (NP) , a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field- programmable gate array (FPGA) , or another programmable logic device, discrete gate or transistor logic device, discrete hardware component.

Memory 212 may include high-speed random access memory, such as DRAM, SRAM, or other random access solid state memory devices. In some implementations, memory 212 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In some implementations, memory 212 includes one or more storage devices remotely located from processor (s) 202. Memory 212, or alternatively one or more storage devices (e.g., one or more nonvolatile storage devices) within memory 212, includes a non-transitory computer readable storage medium. In some implementations, memory 212 or the computer readable storage medium of memory 212 stores one or more computer program instructions (e.g., modules 220) , and a database 240, or a subset thereof that are configured to perform one or more steps of processes as discussed below with reference to FIGs. 3, 4, 5, 8, and 9. Memory 212 may also store audio signal or speech data obtained by audio sensor 174 and/or images captured by image sensor 107, for processing by processor 202, operations instructions for controlling UAV 102, audio sensor 174, image sensor 107, and/or the like.

In some embodiments, memory 212 of apparatus 200 may include an operating system 214 that includes procedures for handling various basic system services and for performing hardware dependent tasks. Apparatus 200 may further include a network communication module 216 that is used for connecting apparatus 200 to other electronic devices via communication network interface 208 and one or more communication networks 120 (wired or wireless) , such as the Internet, other wide area networks, local area networks, metropolitan area networks, etc. as discussed with reference to FIG. 1.

In some embodiments, modules 220 include an image obtaining and processing module 222 configured to receive and process image data captured by image sensor 107 onboard UAV 102. For example, image obtaining and processing module 222 can be configured to perform facial recognition, gesture detection, human detection, or other suitable functions based on the image data captured by image sensor 107. In some embodiments, modules 220 include an audio obtaining and processing module 224 configured to receive and process audio data detected by audio sensor 174 onboard UAV 102. For example, audio obtaining and processing module 224 can be configured to receive and pre-processing the audio data. In some embodiments, modules 220 may be included in other device (s) communicatively coupled to UAV 102, such as remote control 130, mobile device 140, and/or server 110. As such, audio obtaining and processing module 224 on the corresponding device may receive and process audio data detected by audio sensor 174 onboard UAV 102. Audio data may also be detected by remote control 130 or mobile device 140. Accordingly, audio obtaining and processing module 224 on remote control 130 or mobile device 140 can obtain and process the detected audio data. On the other hand, audio obtaining and processing module 224 onboard UAV 102 can also obtain the audio data detected by remote control 130 or mobile device 140 (e.g., via network 120) for processing. In some embodiments, modules 220 further include a speech recognition module 225 configured to apply speech recognition models and algorithms to the audio data to obtain speech information, such as speech command for operating UAV 102. In some embodiments, modules 220 also include a speaker recognition module 226 configured to apply speaker recognition models and algorithms to the audio data to identify speaker (s) who spoke the audio data. In some embodiments, modules 220 further include an authorized user verification module 228 configured to verify whether an identified user, e.g., the identified speaker (s) who spoke audio data detected by audio sensor 174, or speaker (s) identified based on facial recognition or gesture recognition, are authorized to operate UAV 102. In some embodiments, modules 220 include an operation mode control module 230 configured to control various operation modes associated with operating UAV 102, including but not limited to, a first operation mode permitting control of at least one function associated with UAV 102 only by an authorized operator, and a second operation mode permitting control of any function associated with UAV 102 by any person. Operation mode control module 230 may be configured to determine an operation mode under which UAV 102 currently operates. Operation mode control module 230 may be further configured to initiate a certain operation mode or switch between multiple operation modes in accordance with determining that one or more predetermined criteria are satisfied. In some embodiments, modules 220 also include an operation instruction generation module 232 configured to generate instructions for controlling one or more functions associated with operating UAV 102.

In some embodiments, database 240 stores speech recognition model (s) 242 including instructions for applying speech recognition algorithms to the audio data detected by audio sensor 174 onboard UAV 102, or audio sensor (s) of remote control 130 or mobile device 140 to obtain speech information including speech command for operating UAV 102. In some embodiments, database 240 further stores speaker recognition model (s) 244 including instructions for applying speaker recognition algorithms to the audio data to identify speaker (s) who spoke the audio data including speech command to control UAV 102. In some embodiments, database 240 stores facial recognition model 246 including instructions for applying facial recognition algorithms or templates to image data for recognizing user identities based on facial features. In some embodiments, database 240 stores gesture recognition model (s) 248 including instructions for applying gesture recognition algorithms or templates to body gesture or motion data detected by image sensor 107 for recognizing user body gestures or motions. In some embodiments, database 240 also stores authorized user data 250 including information associated with one or more users who are authorized to control one or more functions associated with UAV 102. For example, authorized user data 250 may include user account information, user activity data, user preference settings, and/or user biometric authentication information used for user authentications, such as audio fingerprint features for speaker recognition and facial features for facial recognition.

Details associated with modules 220 and database 240 are further described with reference to example processes shown in FIGs. 3, 4, 5, 8, and 9 of the present disclosure. It is appreciated that modules 220 and/or database 240 are not limited to the scope of the example processes discussed herein. Modules 220 may further be configured to perform other suitable functions, and database 240 may store information needed to perform such other suitable functions.

FIG. 3 shows a flow diagram of an example process 300 of using speech recognition models for processing audio signals to operate a device, e.g., UAV 102, or a system including one or more devices, in accordance with embodiments of the present disclosure. FIG. 3 further includes a process 320 for training speech recognition model (s) 242 that can be used in process 300. For purposes of explanation and without limitation, process 300 may be performed by one or more modules 220, such as audio obtaining and processing module 224, speech recognition module 225, and operation instruction generation module 232. Process 300 may be performed based on models or data stored in database 240, such as speech recognition model (s) 242. One or more steps of process 300 may be performed by hardware and software executing in a device or a system, such as UAV 102, remote control 130, mobile device 140, server 110, or combinations thereof.

In step 302, audio signals are obtained for processing, e.g., obtained by audio obtaining and processing module 224 of apparatus 200 shown in FIG. 2. In some embodiments, the audio signals can be detected by one or more sensors, such as audio sensor 174 onboard UAV 102 as shown in FIG. 1. Audio sensor 174 may detect audio signals within an ambient environment, for example voice 152 of one or more people 150, as shown in FIG. 1. Audio sensor 174 may also detect audio signals originated from other sources, such as dogs barking, vehicle moving, etc. In some embodiments, when audio obtaining and processing module 224 is onboard UAV 102, the detected audio signals may be transmitted from audio sensor 174 to audio obtaining and processing module 224 for processing to obtain audio data. In some embodiments, the detected audio signals may also be transmitted from audio sensor 174 on UAV 102 to audio obtaining and processing module 224 in remote control 130, movable device 140, or server 110 via network 120 or other suitable communication technique as discussed in the present disclosure. In some embodiments, the audio signals may be detected by the off-board device (s) as disclosed herein, such as remote control 130 or mobile device 140. The detected audio signals may be processed locally by audio obtaining and processing module 224 of remote control 130 or mobile 140, or transmitted to audio obtaining and processing module 224 onboard UAV 102 for processing to obtain the audio data.

In some embodiments, the audio signals may be encoded at different sampling rates (e.g., samples per second, such as 8, 16, 32, 44.1, 48, or 96 kHz) , and different bits per sample (e.g., 8-bits, 16-bits, 24-bits or 32-bits per sample) to obtain the audio data. In some embodiments, audio obtaining and processing module 224 may pre-process the detected audio signals using any suitable signal processing technique to obtain the signal data. For example, the obtained audio signals may be pre-processed into frames (e.g., fragments, segments) at a certain time duration (e.g., 25 ms per frame, or 10 ms per frame) . In some embodiments, the obtained audio signals may be pre-processed in accordance with characteristics of the speech recognition models, such as the sampling rate or bits per sample may be pre-processed to use the sampling rate and/or bits per sample of the training data for training the speech recognition models. In some embodiments, a voice activity detection algorithm may be used to extract audio or speech fragments from the real-time audio data stream of the audio signals obtained from UAV 102. In some embodiments, the obtained audio signals may be pre-processed to exclude audio data with low quality or too short (e.g., insufficient signal-to-noise ratio (SNR) for effectively performing speech recognition) , or with high likelihood of including irrelevant audio information (e.g., ambient noise, background noise, traffic noise, etc. )

In step 304, apparatus 200, e.g., audio obtaining and processing module 224, may extract audio features from the obtained audio data. In some embodiments, the audio data from each frame processed in step 302 may be transformed by applying a conventional Mel-frequency cepstrum (MFC) method. Coefficients from this transformation, e.g., Mel-frequency cepstral coefficients (MFCC) s, and/or other features can be used as an input to the speech recognition models, including an acoustic model and a language model, as discussed below. In some embodiments, other audio features, such as linear predictive coding (LPC) features, filter-bank features, or bottleneck features may be extracted from the audio data.

In step 306, apparatus 200, e.g., speech recognition module 225, may process the audio features extracted from the audio data using speech recognition models (e.g., speech recognition model (s) 242) that have been trained. In some embodiments, the speech recognition models may include an acoustic model and a language model. The speech recognition models, e.g., the acoustic model, may be used to separate speech (e.g. voice 152 in FIG. 1) data from other types of audio data (e.g., dog barking, vehicle moving, etc. ) . In some embodiments, the acoustic model may be used to represent relationships between linguistic features, such as phonemes, included in speech and other types of audio signals. The acoustic model may be trained using training data including audio recordings of various types of audio signals and their corresponding labels. The acoustic model may include a suitable model, such as a statistical model associated with statistical properties of speech.

In some embodiments, a language model may be used for inferring likelihood of word sequences. For example, the language model may include a statistical model that predicts a next word or feature based on previous word (s) or features. In some embodiments, the language model may provide context that helps to improve a probability of arranging words and phrases with similar sounds in a proper and meaningful sequence. The acoustic model and the language model may be combined to search for the text sequence with the maximum likelihood.

In some embodiments, the speech recognition models may include a conventional Gaussian Mixed Model -Hidden Markov Model (GMM-HMM model) for performing the speech recognition process in step 306. The GMM-HMM model may be trained in advance (e.g., in a process 320 as described below) to perform Viterbi decoding to find a speech command with highest probability. In some embodiments, a distribution of features may be modeled with the Gaussian Mixture Model (GMM) that is trained with training data. The transition between features phonemes and corresponding observable features can be modeled with the Hidden Markov Model (HMM) . In some embodiments, the GMM-HMM speech recognition model may use Deep Neural Networks (DNNs) , Long Short Term Memory networks (LSTM) , Convolutional Neural Networks (CNNs) , and/or other suitable means known in the art.

In some embodiments, the speech recognition model may be trained using process 320 as shown in FIG. 3. Process 320 and associated or similar processes may be performed by apparatus 200 and stored in database 240, such as speech recognition model (s) 242 in FIG. 2. Process 320 and associated or similar processes may also be performed by another apparatus or system, and then the trained models can be transmitted to apparatus 200 for use as described herein.

In step 322 of training process 320, training data including speech data is obtained from a plurality of users. Training data may be obtained from authorized users, who are permitted to send speech commands to operate UAV 102. Training data may also be collected from any user, authorized (e.g., permitted, preregistered, etc. ) or unauthorized (e.g., without permission or preregistration, etc. ) , to operate UAV 102. In some embodiments, the collected training speech data include speech commands associated with controlling various functions of UAV 102, carrier 106 of UAV 102, one or more sensors of UAV 102, and any controllable component of UAV 102. For example, the training speech data may include speech commands such as “landing, ” “taking off, ” “snapshots, ” “short videos, ” “recording, ” and “hovering, ” etc. In some embodiments, the training speech data may be collected from diverse users speaking various languages, and/or with accents, both sexes, various ages, etc. In some embodiments, the training speech data may be collected at any sampling rate and pre-processed to certain frames with certain duration (s) prior to the training process. In some embodiments, training speech data may also include false instructions or false commands that are not associated with operation instructions of UAV 102. In some embodiments, each piece of training speech data may be labeled with the corresponding text prior to the training process.

In step 324 of training process 320, audio features, such as MFCC features, LPC features, filter-bank features, or bottleneck features can be extracted from the sampled speech data obtained in step 322. In step 326, the speech recognition model, such as the GMM-HMM model, can be trained. In some embodiments, during the training process, the parameters for the HMM model can be estimated using a Baum-Welch algorithm. The GMM model may be trained using a conventional Expectation Maximization (EM) method, and may be trained one or more times to achieve a proper GMM-HMM model.

Referring back to process 300, in step 308, the speech recognition models (e.g., speech recognition model (s) 242) are applied to the speech data to obtain the corresponding speech information. In some embodiments, the obtained speech information is further processed to recognize speech commands that are associated with operating UAV 102. For example, speech commands for controlling one or more function of UAV 102 can be identified, and other speech text, such as people chatting, conversation on a television, or other irrelevant speech may be ignored. In some embodiments, the speech irrelevant to controlling any function of UAV 102 may be excluded in other suitable step (s) . In some embodiments, one or more pre-defined words or phrases associated with operating UAV 102, such as landing, taking off, photo, video, hover, etc., may be used to search and match words or phrases from the speech text transformed from the audio data in step 306.

In step 310, after obtaining the speech commands associated with operating UAV 102, the corresponding operation instructions may be generated, e.g., by operation instruction generation module 232. In some embodiments, operation instructions generated based on the speech commands may be associated with operating or controlling functions of UAV 102, image sensor 107 onboard UAV 102, and/or audio sensor 174 onboard UAV 102. In some embodiments, controlling instructions may include instructions for controlling one or more parameters of UAV 102, image sensor 107, and/or audio sensor 174, including but not limited to, flight direction, flying speed, flying distance, magnitude, flight mode, UAV positions, positions of image sensor 107, positions of audio sensor 174, focal length, shutter speed, start recording video and/or audio data, aerial photography modes, etc.

The operation instructions generated in step 310 may be transmitted to onboard controller 176 of UAV 102 via any suitable communication networks, as described herein. According to the operation instructions, onboard controller 176 can control various actions of UAV 102 (e.g., taking off or landing, ascending or descending, etc. ) , adjust the flight path of UAV 102 (e.g., hovering above a user) , control image sensor 107 (e.g., changing an aerial photography mode, zooming in or out, taking a snapshot, shooting a video, etc. ) , and/or control audio sensor 174 (e.g., starting listening to the environment, repositioning to listen to an identified user, e.g., an authorized user, etc. ) . The operation instructions may cause onboard controller to generate controlling commands to adjust parameters of propulsion devices 104, carrier 106, image sensor 107, and audio sensor 174, separately or in combination, so as to perform operations corresponding to the speech commands. In some embodiments, operation instructions generated based on the speech commands may first be examined by onboard controller 176 of UAV 102 to determine whether it is safe (e.g., not at risk of colliding with an object in the surrounding environment, functions to be performed consuming energy /power supported by the battery of UAV 102, etc. ) to perform the corresponding operations.

FIG. 4 shows a flow diagram of an example process 400 of performing speaker recognition (e.g., using speaker recognition model (s) 244) in accordance with embodiments of the present disclosure. For purposes of explanation and without limitation, process 400 may be performed by one or more modules 220, such as audio obtaining and processing module 224, speaker recognition module 226, authorized user verification module 228, and operation instruction generation 232. Process 400 may be performed based on data and models stored in database 240, such as speaker recognition model (s) 244 and authorized user data 250. One or more steps of process 400 may be performed by hardware and software executing in a device or a system, such as UAV 102, remote control 130, mobile device 140, server 110, or combinations thereof.

In step 402, audio signals are obtained for processing, e.g., by audio obtaining and processing module 224 as shown in FIG. 2. In some embodiments, the audio signals can be detected by one or more sensors, such as audio sensor 174 onboard UAV 102, or sensor (s) of remote control 130 or mobile device 140 as shown in FIG. 1. The audio signals may include human speech (e.g., voice 152) and other audio signals within the ambient environment (e.g., dogs barking, vehicle moving, etc. ) . In some embodiments, the detected audio signals may be transmitted from audio sensor 174 to audio obtaining and processing module 224 onboard UAV 102 for processing. In some embodiments, the detected audio signals may be transmitted from UAV 102 to audio obtaining and processing module 224 in remote control 130, movable device 140, or server 110 via network 120 or other suitable communication technique as discussed in the present disclosure. In some embodiments, the audio signals detected by sensor (s) of remote control 130 or mobile device 140 may be processed locally at the receiving device or transmitted to UAV 102 for processing. In some embodiments, audio obtaining and processing module 224 may pre-process the audio signals substantially similarly as discussed with reference to step 302 to obtain audio data. For example, the audio signals can be pre-processed into frames. The audio signals can also be pre-processed to exclude irrelevant audio information and preserve audio information that can be used for processing in the following steps.

In step 404, apparatus 200, e.g., audio obtaining and processing module 224, may extract features (e.g., acoustic features) from the obtained audio data that are related to recognizing speaker identity, such as i-vectors, GMM supervectors, or cepstral features, etc. In some embodiments, the i-vectors include a set of low-dimensional factors (e.g., compressed from supervectors) to represent a low-dimension subspace (e.g., total variability space) , which contains speaker and session variability. The i-vectors may be represented by eigenvectors with certain eigenvalues.

In some embodiments, other types of features associated with recognizing the speaker identity may include Perceptual Linear Prediction (PLP) features, Linear Prediction Coefficient (LPC) features, Linear Prediction Cepstrum Coefficient (LPCC) features, Meier Frequency Cepstrum System (MFCC) Number characteristics, or other suitable features. The features may be extracted from respective frames of the audio data.

In step 406, apparatus 200, speaker recognition module 226, may process the identity features extracted from the audio data using speaker recognition models (e.g., speaker recognition models 244 in FIG. 2) that have been trained to identify the speaker identity. In some embodiments, the speaker recognition models may include a Gaussian Mixing Model-Universal Background Model (GMM-UBM) . Other types of models or processes can also be used for speaker recognition, such as (JFA) Joint Factor Analysis, machine learning models, or neural network algorithms, for analyzing audio fingerprint from the audio data. In some embodiments, the speaker recognition models may be trained by apparatus 200 and stored in database 240. In some embodiments, the speaker recognition models may be trained by another device or system, and the trained models may then be sent to apparatus 200 for performing speaker recognition.

In some embodiments, a speaker recognition model may include a front-end component and a back-end component. In some embodiments, the front-end component may be used to transform acoustic waveforms into compact and less redundant acoustic features (e.g., Cepstral features) . The front-end component can also be used for speech activity detection (e.g., distinguish speech data from other audio data, such as ambient noise) . For example, the front-end component can retain portions of the waveforms with high signal-to-noise (SNR) ratio. The front-end component can also perform other types of processing, such as normalization, etc.

In some embodiments, the back-end component may be used to identify and verify the speaker identity using the pre-trained models (e.g., speaker recognition models 244) . In some embodiments, models associated with respective speakers (e.g., speaker-specific models) can represent the acoustic (e.g., phonetic) space of each speaker. In some embodiments, the speaker recognition models may be trained based on speech data spoken by a plurality of speakers. The training data may include speech data spoken by one or more authorized users of the movable object. The training data may also include speech data related to speech commands used for controlling one or more functions of the movable object. In some embodiments, identity vectors, such as i-vectors can be extracted from the speech data used for training. The extracted vectors can be used for training the speaker recognition models (e.g., GMM-UBM models) .

In some embodiments, a Universal Background Model (UBM) may be formed from a plurality of speaker-specific models that are obtained based on the training data (e.g., speech data) from a plurality of speakers. For example, the UBM can be obtained using a Gaussian Mixture Model (GMM) with an Expectation Maximization (EM) method. The speaker-specific models may be adapted from the UBM using a maximum aposteriori (MAP) estimation. The UBM model may represent common acoustic characteristics of different speakers. When evaluating speaker recognition models 244, each test segment can be scored against the speaker-specific models to recognize the speaker identity, or against the background model (e.g., the UBM) and a given speaker model to verify whether the speaker identity matches the given speaker model.

In some embodiments, the i-vectors (e.g., obtained in step 404) can be normalized and modeled with a generative factor analysis approach, such as probabilistic LDA (PLDA) . Log-likelihood ratios (LLRs) between speakers can be used for verifying speaker identity.

In some embodiments, the speaker recognition models, e.g., speaker recognition model (s) 244, may be further trained (e.g., registered or customized) after establishing ownership (s) of a particular movable object using speech data spoken by one or more pre-registered users (or authorized users) of the particular movable object (e.g., UAV 102) . For example, after purchasing UAV 102, one or more authorized users may be instructed to read a paragraph of pre-determined text (e.g., prompted on a display device or printed on the manual or packaging box) for collecting the speech data. The identity vectors can be extracted from the speech data, and the GMM-UBM models can be further modified according to the maximum posterior criterion. Accordingly, speaker recognition models 244 used for different movable objects may be different from each other, as each speaker recognition model can be customized (e.g., fine tuned) to have an optimized performance when working with the authorized user (s) of the corresponding movable object.

In step 408, after identifying the speaker identity associated with audio data in step 406, apparatus 200, e.g., authorized user verification module 228, can determine whether the identified speaker is an authorized user of UAV 102. For example, authorized user verification module 228 can compare the speaker identity identified in step 406 against a list of authorized user (s) (e.g., stored in authorized user data 250) who are permitted to control one or more functions associated with at least a part of UAV 102. Authorized user verification module 228 can also use other methods, such as comparing audio fingerprint data extracted from the audio data obtained in

step

402 or 404 with audio fingerprint data stored in authorized user data 250 to determine whether the audio data detected by audio sensor 174 is spoken by an authorized user. In some embodiments, an instruction can be generated by operation instruction generation module 232 to indicate whether the audio data detected by audio sensor 174 is spoken by an authorized user. In some embodiments, instructions can also be generated by speaker recognition module 226 to indicate an identity of a speaker who has spoken the audio data detected by audio sensor 174. In some embodiments, the generated instruction may be transmitted to onboard controller 176 of UAV 102 via any suitable communication network.

FIG. 5 shows a flow diagram of an example process 500 of operating a device, such as a movable object (e.g., UAV 102) , or a system, based on a speech command in accordance with embodiments of the present disclosure. The speech command may be obtained from audio data detected by audio sensor 174 of UAV 102. For purposes of explanation and without limitation, process 500 may be performed by one or more modules 220 and database 240 of apparatus 200 shown in FIG. 2. For example, one or more steps of process 500 may be performed by software executing in a device or a system, such as UAV 102, remote control 130, mobile device 140, server 110, or combinations thereof.

In step 502, audio signals, including speech commands, are received. In some embodiments, the audio signals may be detected by audio sensor 174 onboard UAV 102 or sensor (s) of remote control 130 or mobile device 140. The detected audio signals may be obtained by apparatus 200, such as audio obtaining and processing module 224. In some embodiments, the audio signals may include speech commands (e.g., speech command 152 in FIG. 1) spoken by a user within a certain range of UAV 102, such as a detectable range of audio sensor 174, or within detectable range (s) of sensor (s) of remote control 130 or mobile device 140. In some embodiments, the audio signals may further include other ambient sound or environment noise. In some embodiments, the speech commands are associated with operating the movable object, such as UAV 102. For example, the speech commands may include an instruction to control UAV 102, such as landing, taking off, hovering, changing positions, etc. The speech commands may also include an instruction to control image sensor 107 onboard UAV 102, such as adjusting the position of carrier 106 and/or one or more parameters of image sensor 107. The speech commands may further include an instruction to control audio sensor 174, such as adjusting the position and/or one or more audio parameters of audio sensor 174.

In step 504, an operation mode in which the movable object (e.g., UAV 102) currently operates is determined, e.g., by operation mode control module 230. In some embodiments, the operation mode is associated with a speaker’s authorization to control at least one or more functions of the movable object (e.g., UAV 102) .

In some embodiments, the speaker’s authorization may be associated with a permission, a right, or an eligibility to control UAV 102. In some embodiments, a user who has been granted the speaker’s authorization (e.g., also referred to herein as an authorized user, an authorized operator, or an authorized person) can use speech command (s) to control one or more functions of UAV 102 or components (e.g., image sensor 107 or audio sensor 174) associated with UAV 102 as disclosed. In some embodiments, an authorized user can also control one or more functions associated with UAV 102 or components associated with UAV 102 using instructions in other formats, such as gestures detected by image sensor 107, or user input received via input device (s) 204 (e.g., a touchscreen) . In some embodiments, the speaker’s authorization may be predetermined, preselected, or pre-registered. In some embodiments, the speaker’s authorization may be associated with ownership of UAV 102 (e.g., established through purchase and registration) . For example, only owner of UAV 102 can be granted the speaker’s authorization. In some embodiments, the speaker’s authorization may be associated with an administrative power. For example, one or more users may be granted the administrative power, including speaker’s authorization, to operate UAV 102.

In some embodiments, the movable object, such as UAV 102, may be operable in a plurality of operation modes. In some examples, UAV 1 02 may be able to operate in a first operation mode which permits control of at least one function associated with UAV 102 only by an authorized user. In some examples, a second operation mode permits control of any function associated with UAV 102 by any user, regardless whether the user is authorized or not authorized to control UAV 102 or components of UAV 102.

In some embodiments, in step 506, when operation mode control module 230 determines that the movable object (e.g., UAV 102) currently operates in the first operation mode, it is determined that only an authorized user can be permitted to use speech commands to control at least one function associated with UAV 102.

In some embodiments, the first operation mode may be pre-set to be associated with permitting only an authorized user to control any function associated with UAV 102 and any components associated with UAV 102, such as image sensor 107 and/or audio sensor 174. In some embodiments, out of safety concerns, the first operation mode may be pre-set to be associated with allowing any user to use speech commands to control certain functions (e.g., relatively non-essential functions, such as entertainment related functions) , while permitting only an authorized user to use speech commands to control certain functions, such as important and essential functions, associated with UAV 102 or a component associated with UAV 102, such as image sensor 107 or audio sensor 174. For example, positions and motion of UAV 102 can only be controlled by an authorized user using speech commands to ensure safety. Meanwhile, other functions, such as taking photos or videos using image sensor 107, or repositioning audio sensor 174 to listen to a particular speaker, can be controlled by any user using speech commands.

In some embodiments, in the first operation mode, any user may be able to use speech commands to select certain automatic functions, such as pre-set programs with pre-programmed functions, settings, or parameters. Meanwhile only an authorized user can adjust the parameters or settings or combinations thereof associated with certain programs. For example, when using image sensor 107 onboard UAV 102 for aerial photography, any user may use speech commands to take photos, record videos, record audio, or adjust photography modes for automatic photography functions. For example, when zooming in the camera lens, other parameters, e.g., focal length, ISO, may be automatically adjusted for optimized effect. Meanwhile, only an authorized user, e.g., the owner, may use speech command to adjust the specific photography parameters associated with one or more predetermined programs or modes.

In some embodiments, the first operation mode may be implemented or activated (e.g., by operation mode control module 230) to operate UAV 102 in accordance with determining that at least one predetermined criterion, described below, is satisfied. In some embodiments, the activation of the first operation mode may take place prior to determining an operation mode in step 504. In some embodiments, the first operation mode may be activated in response to a user’s instruction received on input device (s) 204 of user interface 203 to start the first operation mode, such as a speech command detected by audio sensor 174, or a gesture detected by image sensor 107. In some embodiments, the first operation mode may be automatically activated in accordance with detecting that an authorized user is included in a field of view (FOV) of image sensor 107. For example, at least one function of UAV 102 can be controlled by instructions (e.g., speech commands, gesture commands, etc. ) from the authorized user detected in the FOV. In some embodiments, the first operation mode may be automatically activated when UAV 102 is operating in a predetermined scenario, such as a scenario with safety requirements, a scenario associated with at least one essential function of UAV 102, or a scenario that may cause safety concerns for operating UAV 102 without regulating the speaker’s authorization. In some embodiments, in the first operation mode, some examples of applying the first operation mode in a movable object, such as UAV 102, a robot, or an artificial intelligent device or system, are described below.

In some examples, when a plurality of people are talking near UAV 102 at the same time, in order to avoid confusion caused by triggering audio sensor 174 to respond to any audio data from any source and to ensure safety and accuracy for operating UAV 102, apparatus 200, e.g., operation mode control module 230, may automatically start the first operation mode, such that UAV 102 can only be operated by instructions, e.g., speech commands, from an authorized user, e.g., the owner of UAV 102.

In some examples, when UAV 102 is used in an agricultural scenario to help with spraying pesticide, any user may be able to use speech commands to control non-essential features or select pre-programed functions, such as setting boundaries of farmlands, positioning UAV 102 or the spraying equipment onboard UAV 102 relative to the farmlands, or selecting a pre-set program with predetermined parameters. However, only an authorized user, e.g., the owner, can control the action of starting to spray the pesticide onto farmland, selecting or changing a type of pesticide for spraying, or changing specific parameters associated with pre-set programs.

In some examples, apparatus 200 may be used to control a movable object, such as a robot (e.g., an educational robot) or an artificial intelligence module, device, or system integrated or communicatively coupled to the robot, for publishing comments overlaid on a video that is being viewed by the user, such as bullet comments or Danmaku comments. In some embodiments, any user may be able to control non-essential features or to select a program from pre-set programs for publishing the comments, such as adjusting a path for displaying the comments on a display, including parameters such as a direction, a speed, a font size, a font color, etc. However, only an authorized user, such as an owner of the movable object, can instruct the movable object (e.g., via speech commands or other types of commands) to publish the comments.

In some examples, any user can use speech commands to launch a pre-programmed control program of UAV 102, such as automatically adjusting flight movement, gimbal position, flight direction, audio broadcast, light settings, photography mode, or other automatic programs. Once a control program is selected, associated parameters (e.g., height, pitch, yaw, role, speed, volume, brightness, lens parameters, etc. ) can be automatically set to pre-determined values in accordance with the pre-programed settings of the selected control program. However, only an authorized user, such as the owner of UAV 102, can use speech commands to adjust essential parameters for controlling UAV 102, such as the specific parameters (e.g., height, pitch, yaw, role, speed, etc. ) associated with flight movement, flight direction, or gimble attitude.

In some examples, apparatus 200 may be used to remotely control a movable object, such as a robot. For example, any user can user speech commands to select between pre-set programs for radar scanning, or sample collecting, etc. using pre-programed parameters. However, only an authorized user can adjust the specific parameters associated with each pre-set program.

In step 508, in some embodiments, after determining that UAV 102 currently operates in the first operation mode in step 506, it is further verified whether the audio signals received in step 502 includes speech commands that are spoken by a user authorized to operate UAV 102. Various methods or processes can be used for verifying the user’s authorization to operate UAV 102.

FIGs. 6A-6B show examples of controlling a device, such as UAV 102, via speech commands alone or in combination with image recognition based on one or more images captured by image sensor 107 of UAV 102 in accordance with embodiments of the present disclosure. In some embodiments, audio sensor 174 of UAV 102 may detect audio signals, including speech commands 604, spoken by a user 602. In some embodiments, apparatus 200 may perform speaker recognition on the audio signals including speech commands 604 (e.g., received in step 502) in accordance with the steps of process 400. In some embodiments, speaker recognition module 226 may identify an identity of user 602 using speaker recognition model (s) 244 as disclosed herein. In some embodiments, authorized user verification module 228 may determine (e.g., based on authorized user data 250) whether the identified speaker (e.g., user 602) is an authorized user to operate UAV 102. In some embodiments, authorized user verification module 228 may compare audio features extracted from the audio data including speech commands 604 with pre-stored authorized user data 250 to determine whether the speech commands 604 are spoken by an authorized user.

In some embodiments, as shown in FIGs. 6A-6B, apparatus 200 may verify whether speech commands 604 are spoken by an authorized user based on one or more images (e.g., an image 650 in FIG. 6B) captured by image sensor 107 onboard UAV 102. In some embodiments, the audio fingerprint features extracted from speech commands 604 may be insufficient to effectively perform speaker recognition process 400. For example, UAV 102 may be too far away from user 602, ambient noise from the environment may be too loud, user 602 may not speak loudly enough, or illness may change or affect the voice of user 602 and interfere with recognition. In some embodiments, UAV 102 may be working in a sensitive scenario with higher safety or security requirements, and thus an additional modality of speaker authentication may be required (e.g., in addition to speaker recognition based on voice) . Accordingly, speaker authorization verification may be further processed based on the captured image (s) , such as image 650.

In some embodiments, the position and/or parameters of image sensor 107 may be adjusted to capture the one or more images, e.g., image 650, including at least a portion of user 602 (e.g., face and/or hand gesture) . Image 650 may be captured by image sensor 107 and received by apparatus 200, e.g., image obtaining and processing module 222. As shown in FIG. 6B, image 650 includes user 602 associated with speaking speech commands 604. For example, based on time stamps associated with image 650 and speech commands 604, or based on a motion detected on the face of user 602, it can be determined that user 602 is the speaker of speech commands 604. Image 650 may be processed, e.g., by image obtaining and processing module 222, to determine whether user 602 is an authorized user. As discussed herein, image 650 may be processed for verifying speaker authorization in addition to speaker recognition/authorization based on audio features extracted from speech commands 604, for example, when at least two modalities for verifying speaker authorization are required. Image 650 may also be processed for verifying speaker authorization separately and independently from audio feature recognition based on speech commands 604, for example, when audio data is not sufficient for performing speaker recognition using audio data or simply as an alternative way of speaker recognition.

In some embodiments, image obtaining and processing module 222 may recognize one or more gestures (or poses, movements, motions, body indications, etc. ) , such as a gesture 652 or a mouth movement 656 from image 650. In some embodiments, in order to determine a motion or a moving gesture associated with a portion of user 602, more than one image may be acquired for analysis. In some embodiments, locations of a portion of the body of user 602, such as a hand, can be identified in image 650. Then one or more feature points (or key physical points) of the hand may be determined in image 650. In some embodiments, pixel values associated with the detected hand may be converted into feature vectors. In some embodiments, predetermined templates or pretrained models may be used to determine hand gestures 652 or poses based on locations and other characteristics of the one or more key physical points. In some embodiments, in accordance with determining that the detected hand gesture 652 satisfies at least one predetermined criterion, it is determined that the associated user, e.g., user 602, is an authorized user. For example, when it is determined that hand gesture 652 of user 602 (who spoke speech commands 604) is pointing at image sensor 107, user 602 is verified to be an authorized user. In some other examples, when it is determined hand gesture 652 of user 602 is a gesture held up, pointing left, pointing right, pointing down, making a circle in the air, etc., user 602 can be verified to be an authorized user to control UAV 102.

In some embodiments, image obtaining and processing module 222 may perform facial recognition 654 based on image 650. In some embodiments, the face of user 602 may be identified in image 650. Then one or more feature points (or key physical points) of the face may be determined in image 650. In some embodiments, pixel values associated with the detected face or feature points may be converted into feature vectors. In some embodiments, predetermined templates or pretrained models may be used to perform facial recognition 654. In some embodiments, facial recognition 654 may generate a result indicating an identity of user 602. Further, authorized user verification module 228 may determine, based on authorized user data 250, whether user 602 has speaker authorization or another type of authorization to operate UAV 102.

FIGs. 7A-7B show examples of controlling UAV 102 via speech commands and image recognition based on one or more images captured by image sensor 107 of UAV 102 in accordance with embodiments of the present disclosure. In some embodiments, audio sensor 174 of UAV 102 may detect audio data, including speech commands 704. In some embodiments in step 508, whether speech commands 704 are spoken by an authorized user is verified based on one or more images, including image 750, captured by image sensor 107. In some embodiments, image 750 as shown in FIG. 7B includes a plurality of people 700 shown in FIG. 7A. Image 750 may be captured by image sensor 107 and received by apparatus 200, e.g., image obtaining and processing module 222.

In some embodiments as shown in FIG. 7B, when it is unknown toward which person to point image sensor 107 for tracking, which person’s voice is to be collected, or who spoke speech commands 704 among the plurality of people 700, image 750 can be processed (e.g., by image obtaining and processing module 222) to recognize the person who spoke the speech commands 704, e.g., via gestures or poses (e.g., a hand gesture 752) detected in the field of view of image sensor 107 (e.g., when a person is talking while making hand gesture 752) , or movement of a portion of the person’s body associated with speaking speech commands 704, such as person’s mouth is moving or the movement of which person’s mouth is associated with speaking (e.g., a mouth movement 756) . After identifying which person from a plurality of people 700 (e.g., a user 702) moves his mouth in one or more images, including image 750, apparatus 200 (e.g., operation instruction generation module 232) may generate instructions to adjust positions of UAV 102 and audio sensor 174 to “listen to” (e.g., effectively receive) speech commands spoken by user 702. Apparatus 200 may also generate instructions to control UAV 102 and audio sensor 174 to automatically track user 702 and listen to speech commands 704 from user 702. In some embodiments, apparatus 200 may further verify the identity of user 702 who moves his mouth in the view of image sensor 107 and determine whether user 702 is an authorized user. In some embodiments, apparatus 200 can further verify that speech commands 704 are spoken by the identified authorized user, e.g., user 702, using speaker recognition process as discussed in process 400.

In some embodiments, as shown in FIG. 7B, when more than one person are captured in image 750, and when more than one person are talking, such as user 702 speaking speech commands 704 and user 706 speaking speech content 708, apparatus 200, e.g., image obtaining and processing module 222, may process image 750 using facial recognition 754, e.g., by facial recognition module 246, to identify an authorized user, such as the owner of UAV 102. After identifying the authorized user (e.g., user 702) , apparatus 200 (e.g., operation instruction generation module 232) may generate instructions to adjust positions of UAV 102 and audio sensor 174 to “listen to” (e.g., effectively receive) speech commands spoken by user 702. Apparatus 200 may also generate instructions to control UAV 102 and audio sensor 174 to automatically track user 702 and listen to speech commands 704 from user 702.

In some embodiments, when audio sensor 174 detects a plurality of speech commands spoken by a plurality of authorized speakers, such as speech commands 704 spoken by authorized user 702, and speech content 708 spoken by authorized user 706, apparatus 200 may select the speech command from the plurality of the received speech commands to operate UAV 102 based on a time of receipt of the speech command. For example, if speech commands 704 are received prior to speech content 708, apparatus 200 may generate instructions to operate UAV 102 in accordance with speech commands 704. Apparatus 200 may proceed to the next received speech commands after completing the execution of instructions associated with speech commands 704. Apparatus 200 may also select the speech command based on a predetermined priority associated with a speaker of the speech command. For example, if user 702 is preassigned a higher priority level or authorization level than user 706, apparatus 200 may generate instructions to operate UAV 102 in accordance with speech commands 704, rather than speech content 708.

Referring again to FIG. 5, in some embodiments, when it is determined in step 508 that the audio data is not spoken by an authorized user (step 508 -NO) , apparatus 200 foregoes operating UAV 102 in response to the audio data or speech commands contained therein. For example, UAV 102 ignores the audio data detected by audio sensor 174 without taking any action in response. In some embodiments, apparatus 200 may generate a notification to the user to be displayed, broadcasted, or sent in any form by remote control 130, mobile device 140, and/or server 110, to inform or alert the user of receiving audio information from an unauthorized user.

In some embodiments, when it is determined in step 508 that the audio data is spoken by an authorized user (step 508 -YES) , method 500 proceeds to step 520 to perform speech recognition (e.g., by speech recognition module 225) on the audio data to recognize speech commands spoken by the authorized user to control UAV 102.

In step 520, in some embodiments, speech recognition module 225 can perform speech recognition on speech commands spoken by the authorized user, such as speech commands 604 by user 602 or speech commands 704 by user 702. Speech recognition may be performed according to process 300 in FIG. 3.

In step 522, instructions may be generated, e.g., by operation instruction generation module 232, based on speech commands obtained from speech recognition performed in step 520. The movable object (e.g., UAV 102) may be caused to be operated in accordance with the instructions associated with speech commands spoken by the authorized user (e.g., determined in step 520) .

In step 516, when operation mode control module 230 determines in step 504 that the movable object (e.g., UAV 102) currently operates in the second operation mode, it is determined that any user can use speech commands to control any function associated with UAV 102.

Accordingly, method 500 proceeds to step 520 to perform speech recognition on the audio data (e.g., received and processed in step 502) . In some embodiments, apparatus 200, e.g., speech recognition module 225, can perform speech recognition in accordance with process 300 in FIG. 3 to obtain speech commands contained in the audio data to control UAV 102.

In step 522, instructions may be generated, e.g., by operation instruction generation module 232, based on speech commands obtained from speech recognition performed in step 520. The movable object (e.g., UAV 102) may be caused to be operated in accordance with the instructions associated with speech commands spoken by any user.

FIG. 8 shows a flow diagram of an example process 800 of operating a device, such as a movable object (e.g., UAV 102) based on a speech command in accordance with embodiments of the present disclosure. The speech command may be obtained from audio signals detected by audio sensor 174 of UAV 102. For purposes of explanation and without limitation, process 800 may be performed by one or more modules 220 and database 240 of apparatus 200 shown in FIG. 2. For example, one or more steps of process 800 may be performed by software executing in a device or a system, such as UAV 102, remote control 130, mobile device 140, server 110, or combinations thereof.

In step 802, it is determined, e.g., by operation mode control module 230, what operation mode UAV 102 currently operates in. For example, as disclosed herein, operation mode control module 230 determines whether UAV 102 operates in the first or the second operation mode associated with a speaker’s authorization to control at least one function of UAV 102 or a component (e.g., image sensor 107 or audio sensor 174) associated with UAV 102. As described above, the first operation mode permits control of at least one function associated with UAV 102 only by an authorized user, and the second operation mode permits control of any function associated with UAV 102 by any user. Based on the result of step 802, the movable object is caused to operate in accordance with the determined operation mode.

In some embodiments, in step 804, when operation mode control module 230 determines in step 802 that UAV 102 is in the first operation mode, it is determined that only an authorized user can be permitted to use speech commands to control at least one function associated with UAV 102. Various embodiments associated with the first operation mode are described with reference to FIG. 5.

In some embodiments, UAV 102 may automatically initiate the first operation mode in accordance with determining that at least one predetermined criterion is satisfied. The predetermined criteria may include a scenario with higher security or safety requirements, operating UAV 102 in a manner that requires changing parameters associated with one or more essential functions, ensuring safety and security of UAV 102, or any other criteria described herein. The first operation mode may also be activated in response to an instruction from an authorized user, such as a manual selection, a speech command, or a gesture. The first operation mode may also be activated in response to detecting that an authorized user appears in the field of view of image sensor 107.

In step 806, an authorized user may be identified. In some embodiments, the authorized user may be identified based on information detected by one or more sensors, including image sensor 107 and/or audio sensor 174, onboard UAV 102. In some embodiments as described with reference to in FIG. 6B, one or more images (e.g., including image 650) may be captured by image sensor 107, and image obtaining and processing module 222 may perform facial recognition 654 to identify an identity of user 602 included in image 650. In some embodiments, authorized user verification module 228 may further determine, based on authorized user data 250, whether user 602 has speaker authorization or another type of authorization to operate UAV 102. In some embodiments as described with reference to FIGs. 7A-7B, an image (e.g., image 750) including a plurality of people may be captured, and facial recognition 754, may be used to identify an authorized user, e.g., user 702, from the plurality of people 700.

In some embodiments as described with reference to FIG. 6B, hand gesture 652 or other body gestures or poses (e.g., mouth movement 656) may be detected from analyzing image 650. In some embodiments, in accordance with determining that the detected hand gesture 652 satisfies at least one predetermined criterion as described herein, it is determined that user 602 is an authorized user. In some embodiments as described with reference to FIGs. 7A-7B, an image (e.g., image 750) including a plurality of people may be captured, and gesture recognition 752 or mouth movement 756 may be used to identify an authorized user, e.g., user 702. For example, user 702 may be identified in accordance with determining that the mouth of user 702 is moving. User 702 may be further verified to be an authorized user.

In some embodiments, speech content 604 spoken by user 602 and detected by audio sensor 174 or sensor (s) of remote control 130 or mobile device 140 may be analyzed by audio obtaining and processing module 224, speaker recognition module 226, and authorized user verification module 228 to recognize identity and verify speaker authentication of user 602. In some embodiments as described with reference to FIGs. 7A-7B, an image (e.g., image 750) including a plurality of people may be captured, and speaker recognition may be performed to speech commands 704 to identify an authorized user, e.g., user 702.

Other suitable methods can also be used to identify an authorized user, such as a user logging into a previously registered account via user input device (s) 204 to confirm the user’s speaker authentication. In some embodiments, information captured by more than one type of sensor may be required for identifying or verifying an authorized user, such as image (s) captured by image sensor 107 and speech detected by audio sensor 174.

In step 808, a first instruction may be received from the authorized user identified in step 806. In some embodiments, the first instruction may be received by one or more sensors onboard UAV 102. In some embodiments, the first instruction may include speech commands spoken by the identified authorized user, e.g.,

user

602 or 702, and can be detected by audio sensor 174. In some embodiments, the first instruction may be detected by one or more off-board devices communicatively coupled to UAV 102, such as remote control 130 or mobile device 140. The speech commands may be processed using a speech recognition process, such as process 300 in FIG. 3, to identify the commands spoken by the authorized user to control UAV 102. In some embodiments, the first instruction may include a hand or body gesture (e.g., a movement of at least a portion of the user’s body, such as mouth movement 656) associated with the identified authorized user and can be captured in one or more images by image sensor 107. The captured images may be processed to understand the hand or body gesture associated with operating UAV 102. In some embodiments, the first instruction may also be user input from the authorized user and received from input device (s) 204 to control UAV 102.

In some embodiments, after identifying an authorized user in step 806, a position of audio sensor 174 onboard UAV 102 may be adjusted to receive instructions, such as speech commands, from the identified authorized user. For example, UAV 102 and audio sensor 174 may be adjusted for tracking and listening to the authorized user.

In step 810, operation instructions may be generated (e.g., by operation instruction generation module 232) based on the first instruction received in step 808, and UAV 102 can be caused to operate in accordance with the first instruction.

In some embodiments, in step 812, when operation mode control module 230 determines in step 802 that UAV 102 is in the second operation mode, it is determined that any user can be permitted to use speech commands to control any function associated with UAV 102. Various embodiments associated with the second operation mode as described above with reference to FIG. 5.

In step 814, a second instruction may be received from any user. In some embodiments, the second instruction may be received by one or more sensors onboard UAV 102. In some embodiments, the second instruction may include speech commands spoken by any user and can be detected by audio sensor 174. The speech commands may be processed using a speech recognition process, such as process 300 in FIG. 3, to identify the commands spoken by the user to control UAV 102. In some embodiments, the second instruction may include a hand or body gesture from any user included in one or more images captured by image sensor 107. The captured images may be processed to understand the hand or body gesture associated with operating UAV 102. In some embodiments, the second instruction may also be a user input received from input device (s) 204 to control UAV 102.

In step 816, operation instructions may be generated (e.g., by operation instruction generation module 232) based on the second instruction received in step 814, and UAV 102 can be caused to operate in accordance with the second instruction.

In some embodiments, even when UAV 102 operates in the second operation mode in step 818, apparatus 200 determines whether the second instruction is spoken by an authorized user. For example, the second instruction received in step 814 may be processed using speaker recognition process 400 in FIG. 4, and processed by authorized user verification module 228 to determine whether the speech commands are spoken by authorized user. As described herein, other methods such as facial recognition or gesture detection, can also be used for determining whether the second instruction is issued by an authorized user. In accordance with determining that the second instruction is spoken by the authorized user (step 818 -YES) , in step 820, UAV 102 may be operated in a first manner in accordance with the second instruction. For example, a first set of parameters that have been customized by the authorized user may be used to control UAV 102. In accordance with determining that the instruction is not spoken by an authorized user (step 818 -NO) , in step 822, UAV 102 may be operated in a second manner different from the first manner in accordance with the second instruction. For example, a second set of parameters that have been predetermined to be applicable to any unauthorized user may be used to control UAV 102. For example, when audio sensor 174 detects a speech command “rise, ” if it is determined that the speech command is not spoken by an authorized user, a default operation may be performed, such as UAV 102 elevating 10 meters substantially vertically in the air. When it is determined that the “rise” command is spoken by an authorized user, a customized action can be performed, such as UAV 102 elevating upward with a 45-degree oblique projection for 10 meters. The customized action may be specially customized by the particular user who spoke the command, or may be the same for all authorized users.

In some embodiments, when UAV 102 operates in the second operation mode, apparatus 200, e.g., operation mode control module 230, may cause UAV 102 to switch from the second operation mode to the first operation mode in accordance with determining that at least one predetermined criterion is satisfied. The predetermined criteria may be similar to the predetermined criteria for automatically activating the first operation mode as described herein. For example, operation mode control module 230 may cause UAV 102 to switch to the first operation mode when UAV 102 operates in a scenario with higher safety or security requirements, requires changing parameters associated with one or more essential functions, to ensure safety and security of UAV 102, or any other criteria as described herein. The operation mode may also be switched in response to an instruction from an authorized user, such as a manual selection, a speech command, or a gesture. The operation mode may also be switched in response to detecting that an authorized user appears in the field of view of image sensor 107.

FIG. 9 shows a flow diagram of an example process 900 of operating a device, such as a movable object (e.g., UAV 102) , or a system, in accordance with embodiments of the present disclosure. In some embodiments, process 900 is associated with causing UAV 102 to switch between different operation modes, such as the first operation mode (also referred to as “specific speech recognition” and the second operation mode (also referred to as “non-specific speech recognition” ) . As discussed herein, the specific speech recognition mode may permit control of at least one function associated with UAV 102 only by an authorized user, while the non-specific speech recognition mode may permit control of any function associated with UAV 102 by any user. For purposes of explanation and without limitation, process 900 may be performed by one or more modules 220 and database 240 of apparatus 200 shown in FIG. 2.

In step 902, a speech command (e.g., speech command 604) associated with a first person (e.g., user 602) may be received (e.g., by audio obtaining and processing module 224) . The speech command may be detected by audio sensor 174 onboard UAV 102.

In step 904, auxiliary information associated with a second person may be received. In some embodiments, the auxiliary information comprises a user profile associated with the second person. In some embodiments, the user profile comprises speech information (e.g., other speech different from the speech command received in step 902) associated with the second person. For example, the speech information may be detected by audio sensor 174. In some embodiments, the user profile comprises gesture information associated with the second person. For example, the gesture information may be included in one or more images captured by image sensor 107 and analyzed by image obtaining and processing module 222. In some embodiments, the user profile comprises facial information associated with the second person. For example, the facial information may be included in one or more images captured by image sensor 107 and analyzed by image obtaining and processing module 222.

In some embodiments, instructions may be generated by operation instruction generator module 232 to reposition UAV 102 or one or more sensors of UAV 102 to receive the auxiliary information based on the received speech command. For example, after receiving speech command from user 602, image sensor 107 may be repositioned to track user 602 and/or body gestures or poses of user 602, or audio sensor 174 may be repositioned to point to user 602 to receive other speech spoken by user 602.

In step 906, it is determined whether the first person and the second person are the same person based on the received speech and auxiliary information. In some embodiments, the first person may be identified based on an audio fingerprint from the speech command, for example, by applying speaker recognition process 400 in FIG. 4. In some embodiments, the first person may be identified based on image processing, for example, by facial recognition or gesture detection as discussed herein. In some embodiments, the second person associated with the auxiliary information may be determined in accordance with the type of the auxiliary information. When the auxiliary information includes speech information, speaker recognition 400 can be performed on the speech information to identify the speaker. When the auxiliary information includes gesture information or facial information, image processing may be performed on the associated images to identify the second person. It is then decided whether the first person and the second person are the same person. In some embodiments, whether the first person and second person are the same person is further determined based on a machine learning algorithm.

In step 908, it is decided whether to accept the speech command based on the determination of whether the first and second person are the same person. In some embodiments, only when the first person and the second person are the same, is the speech command received in step 902 accepted. In some embodiments, accepting the speech command comprises switching to the specific speech recognition mode.

It is to be understood that the disclosed embodiments are not necessarily limited in their application to the details of construction and the arrangement of the components set forth in the following description and/or illustrated in the drawings and/or the examples. The disclosed embodiments are capable of variations, or of being practiced or carried out in various ways. The types of user control as discussed in the present disclosure can be equally applied to any type of devices or systems, such as any suitable object, device, mechanism, system, machine, or movable object configured to travel on or within a suitable medium, such as a surface, air, water, rails, space, underground, etc. It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed devices and systems. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed devices and systems. It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

A method for operating a device, comprising:

receiving a speech command associated with operating the device;

determining an operation mode in which the device currently operates, the operation mode associated with a speaker’s authorization to control at least one function of the device; and

causing the device to operate in accordance with the determined operation mode.
The method of claim 1, wherein the operation mode is determined to be a first operation mode permitting control of at least one function associated with the device by an authorized user, the method further comprising:

verifying whether the speech command is spoken by a user authorized to operate the device; and

causing the device to operate in accordance with the speech command spoken by an authorized user that has been verified.
The method of claim 2, wherein verifying whether the speech command is spoken by an authorized user comprises:

receiving one or more images captured by an image sensor onboard the device, the one or more images including a user associated with speaking the speech command; and

processing the one or more images to determine whether the user is an authorized user.
The method of claim 3, wherein processing the one or more images comprises:

recognizing one or more gestures from the one or more images; and

in accordance with determining that the one or more gestures satisfy predetermined criteria, determining that the user is an authorized user.
The method of claim 3, wherein the one or more images are processed using facial recognition to authenticate that the user is an authorized user.
The method of claim 2, wherein verifying whether the speech command is spoken by an authorized user comprises:

receiving an image captured by an image sensor onboard the device, the image including a plurality of people;

processing the image of the plurality of people to identify an authorized user from the plurality of people; and

verifying that the speech command is spoken by the identified authorized user.
The method of claim 2, further comprising:

receiving a plurality of speech commands spoken by a plurality of speakers, respectively; and

selecting, from the plurality of the received speech commands, the speech command to operate the device based on a time of receipt of the speech command or a predetermined priority associated with a speaker of the speech command.
The method of claim 2, wherein verifying whether the speech command is spoken by an authorized user comprises:

analyzing the speech command using a model associated with analyzing an audio fingerprint of the speech command.
The method of claim 2, wherein causing the device to operate in accordance with the speech command comprises:

analyzing the speech command using a model associated with recognizing content of the speech command; and

causing the device to perform an operation in accordance with a result of analyzing the speech command.
The method of claim 2, upon verifying that the speech command is not spoken by an authorized user, the method further comprising:

forgoing operating the device in response to the speech command.
The method of claim 2, further comprising:

activating, prior to determining the operation mode, the first operation mode in accordance with determining that at least one predetermined criterion is satisfied.
The method of claim 1, wherein the operation mode is determined to be a second operation mode permitting control of any function associated with the device by any user, the method further comprising:

analyzing the speech command using a machine learning model associated with recognizing content of the speech command; and

causing the device to operate in accordance with an analysis result of the speech command.
The method of claim 1, wherein the speech command is detected by an audio sensor onboard the device.
The method of claim 1, wherein the speech command is detected by an off-board device communicatively coupled to the device.
An apparatus for operating a device, comprising:

one or more processors; and

memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to perform operations including:

receiving a speech command associated with operating the device;

determining an operation mode in which the device currently operates, the operation mode associated with a speaker’s authorization to control at least one function of the device; and

causing the device to operate in accordance with the determined operation mode.
The apparatus of claim 15, wherein the operation mode is determined to be a first operation mode permitting control of at least one function associated with the device by an authorized user, the memory further storing instructions for:

verifying whether the speech command is spoken by a user authorized to operate the device; and

causing the device to operate in accordance with the speech command spoken by an authorized user that has been verified.
The apparatus of claim 16, wherein verifying whether the speech command is spoken by an authorized user comprises:

receiving one or more images captured by an image sensor onboard the device, the one or more images including a user associated with speaking the speech command; and

processing the one or more images to determine whether the user is an authorized user.
The apparatus of claim 17, wherein processing the one or more images comprises:

recognizing one or more gestures from the one or more images; and

in accordance with determining that the one or more gestures satisfy predetermined criteria, determining that the user is an authorized user.
The apparatus of claim 17, wherein the one or more images are processed using facial recognition to authenticate that the user is an authorized user.
The apparatus of claim 16, wherein verifying whether the speech command is spoken by an authorized user comprises:

receiving an image captured by an image sensor onboard the device, the image including a plurality of people;

processing the image of the plurality of people to identify an authorized user from the plurality of people; and

verifying that the speech command is spoken by the identified authorized user.
The apparatus of claim 16, wherein the memory further stores instructions for:

receiving a plurality of speech commands spoken by a plurality of speakers, respectively; and

selecting, from the plurality of the received speech commands, the speech command to operate the device based on a time of receipt of the speech command or a predetermined priority associated with a speaker of the speech command.
The apparatus of claim 16, wherein verifying whether the speech command is spoken by an authorized user comprises:

analyzing the speech command using a model associated with analyzing an audio fingerprint of the speech command.
The apparatus of claim 16, wherein causing the device to operate in accordance with the speech command comprises:

analyzing the speech command using a model associated with recognizing content of the speech command; and

causing the device to perform an operation in accordance with a result of analyzing the speech command.
The apparatus of claim 16, upon verifying that the speech command is not spoken by an authorized user, the memory further storing instructions for:

forgoing operating the device in response to the speech command.
The apparatus of claim 16, wherein the memory further stores instructions for:

activating, prior to determining the operation mode, the first operation mode in accordance with determining that at least one predetermined criterion is satisfied.
The apparatus of claim 15, wherein the operation mode is determined to be a second operation mode permitting control of any function associated with the device by any user, the memory further storing instructions for:

analyzing the speech command using a machine learning model associated with recognizing content of the speech command; and

causing the device to operate in accordance with an analysis result of the speech command.
The apparatus of claim 15, wherein the speech command is detected by an audio sensor onboard the device.
The apparatus of claim 15, wherein the speech command is detected by an off-board device communicatively coupled to the device.
A non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising:

receiving a speech command associated with operating the device;

determining an operation mode in which the device currently operates, the operation mode associated with a speaker’s authorization to control at least one function of the device; and

causing the device to operate in accordance with the determined operation mode.
The non-transitory computer-readable medium of claim 29, wherein the operation mode is determined to be a first operation mode permitting control of at least one function associated with the device by an authorized user, the non-transitory computer-readable medium further storing instructions for:

verifying whether the speech command is spoken by a user authorized to operate the device; and

causing the device to operate in accordance with the speech command spoken by an authorized user that has been verified.
The non-transitory computer-readable medium of claim 30, wherein verifying whether the speech command is spoken by an authorized user comprises:

receiving one or more images captured by an image sensor onboard the device, the one or more images including a user associated with speaking the speech command; and

processing the one or more images to determine whether the user is an authorized user.
The non-transitory computer-readable medium of claim 31, wherein processing the one or more images comprises:

recognizing one or more gestures from the one or more images; and

in accordance with determining that the one or more gestures satisfy predetermined criteria, determining that the user is an authorized user.
The non-transitory computer-readable medium of claim 31, wherein the one or more images are processed using facial recognition to authenticate that the user is an authorized user.
The non-transitory computer-readable medium of claim 30, wherein verifying whether the speech command is spoken by an authorized user comprises:

receiving an image captured by an image sensor onboard the device, the image including a plurality of people;

processing the image of the plurality of people to identify an authorized user from the plurality of people; and

verifying that the speech command is spoken by the identified authorized user.
The non-transitory computer-readable medium of claim 30, further storing instructions for:

receiving a plurality of speech commands spoken by a plurality of speakers, respectively; and

selecting, from the plurality of the received speech commands, the speech command to operate the device based on a time of receipt of the speech command or a predetermined priority associated with a speaker of the speech command.
The non-transitory computer-readable medium of claim 30, wherein verifying whether the speech command is spoken by an authorized user comprises:

analyzing the speech command using a model associated with analyzing an audio fingerprint of the speech command.
The non-transitory computer-readable medium of claim 30, wherein causing the device to operate in accordance with the speech command comprises:

analyzing the speech command using a model associated with recognizing content of the speech command; and

causing the device to perform an operation in accordance with a result of analyzing the speech command.
The non-transitory computer-readable medium of claim 30, upon verifying that the speech command is not spoken by an authorized user, the non-transitory computer-readable medium further storing instructions for:

forgoing operating the device in response to the speech command.
The non-transitory computer-readable medium of claim 30, further comprising:

activating, prior to determining the operation mode, the first operation mode in accordance with determining that at least one predetermined criterion is satisfied.
The non-transitory computer-readable medium of claim 29, wherein the operation mode is determined to be a second operation mode permitting control of any function associated with the device by any user, the non-transitory computer-readable medium further storing instructions for:

analyzing the speech command using a machine learning model associated with recognizing content of the speech command; and

causing the device to operate in accordance with an analysis result of the speech command.
The non-transitory computer-readable medium of claim 29, wherein the speech command is detected by an audio sensor onboard the device.
The non-transitory computer-readable medium of claim 29, wherein the speech command is detected by an off-board device communicatively coupled to the device.
A method for operating a device, comprising:

determining whether the device is in a first or a second operation mode associated with a speaker’s authorization to operate the device, the first operation mode permitting control of at least one function associated with the device only by an authorized user, the second operation mode permitting control of any function associated with the device by any user; and

causing the device to operate in accordance with the determined first or second operation mode;

upon determining the device is in the first operation mode, the method further comprising:

identifying the authorized user; and

operating the device in accordance with a first instruction spoken by the identified authorized user; and

upon determining that the device is in the second operation mode, the method further comprising:

receiving a second instruction; and

operating the device in accordance with the received second instruction.
The method of claim 43, wherein identifying the authorized user comprises:

processing one or more images captured by an image sensor onboard the device to recognize the authorized user.
The method of claim 44, wherein processing the one or more images comprises:

recognizing one or more gestures from the one or more images; and

in accordance with determining that the one or more gestures satisfy predetermined criteria, recognizing that the user is the authorized user.
The method of claim 44, wherein the one or more images are processed using facial recognition to recognize that the user is the authorized user.
The method of claim 43, wherein identifying the authorized user further comprises:

processing an audio signal detected by an audio sensor onboard the device to recognize the authorized user.
The method of claim 43, wherein identifying the authorized user further comprises:

processing an audio signal detected by an off-board device communicatively coupled to the device to recognize the authorized user.
The method of claim 43, after identifying the authorized user, the method further comprises:

receiving the first instruction by one or more sensors onboard the device.
The method of claim 43, after identifying the authorized user, the method further comprises:

receiving the first instruction by an off-board device communicatively coupled to the device.
The method of claim 43, after identifying the authorized user, the method further comprises:

adjusting a position of an audio sensor onboard the device to receive the first instruction from the identified authorized user.
The method of claim 43, further comprising:

causing the device to automatically initiate the first operation mode in accordance with determining that at least one predetermined criterion is satisfied.
The method of claim 43, wherein the device is determined to be in the second operation mode, the method further comprises:

causing the device to switch from the second operation mode to the first operation mode in accordance with determining at least one predetermined criterion is satisfied.
The method of claim 43, wherein the device is determined to be in the second operation mode, the method further comprising:

determining whether the second instruction is spoken by an authorized user;

in accordance with determining that the second instruction is spoken by the authorized user, operating the device in a first manner in accordance with the second instruction; and

in accordance with determining that the second instruction is not spoken by the authorized user, operating the device in a second manner in accordance with the second instruction, the second manner being distinct from the first manner.
The method of claim 43, wherein the second instruction is detected by one or more sensors onboard the device.
The method of claim 43, wherein the second instruction is detected by an off-board device communicatively coupled to the device.
An apparatus for operating a device, comprising:

one or more processors; and

memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to perform operations including:

determining whether the device is in a first or a second operation mode associated with a speaker’s authorization to operate the device, the first operation mode permitting control of at least one function associated with the device only by an authorized user, the second operation mode permitting control of any function associated with the device by any user; and

causing the device to operate in accordance with the determined first or second operation mode;

upon determining the device is in the first operation mode, the method further comprising:

identifying the authorized user; and

operating the device in accordance with a first instruction spoken by the identified authorized user; and

upon determining that the device is in the second operation mode, the method further comprising:

receiving a second instruction; and

operating the device in accordance with the received second instruction.
The apparatus of claim 57, wherein identifying the authorized user comprises:

processing one or more images captured by an image sensor onboard the device to recognize the authorized user.
The apparatus of claim 58, wherein processing the one or more images comprises:

recognizing one or more gestures from the one or more images; and

in accordance with determining that the one or more gestures satisfy predetermined criteria, recognizing that the user is the authorized user.
The apparatus of claim 58, wherein the one or more images are processed using facial recognition to recognize that the user is the authorized user.
The apparatus of claim 57, wherein identifying the authorized user further comprises:

processing an audio signal detected by an audio sensor onboard the device to recognize the authorized user.
The apparatus of claim 57, wherein identifying the authorized user further comprises:

processing an audio signal detected by an off-board device communicatively coupled to the device to recognize the authorized user.
The apparatus of claim 57, after identifying the authorized user, the memory further storing instructions for:

receiving the first instruction by one or more sensors onboard the device.
The apparatus of claim 43, after identifying the authorized user, the method further comprises:

receiving the first instruction by an off-board device communicatively coupled to the device.
The apparatus of claim 57, after identifying the authorized user, the memory further storing instructions for:

adjusting a position of an audio sensor onboard the device to receive the first instruction from the identified authorized user.
The apparatus of claim 57, the memory further storing instructions for:

causing the device to automatically initiate the first operation mode in accordance with determining that at least one predetermined criterion is satisfied.
The apparatus of claim 57, wherein the device is determined to be in the second operation mode, the memory further storing instructions for:

causing the device to switch from the second operation mode to the first operation mode in accordance with determining at least one predetermined criterion is satisfied.
The apparatus of claim 57, wherein the device is determined to be in the second operation mode, the memory further storing instructions for:

determining whether the second instruction is spoken by an authorized user;

in accordance with determining that the second instruction is spoken by the authorized user, operating the device in a first manner in accordance with the second instruction; and

in accordance with determining that the second instruction is not spoken by the authorized user, operating the device in a second manner in accordance with the second instruction, the second manner being distinct from the first manner.
The apparatus of claim 57, wherein the second instruction is detected by one or more sensors onboard the device.
The apparatus of claim 57, wherein the second instruction is detected by an off-board device communicatively coupled to the device.
A non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising, comprising:

determining whether the device is in a first or a second operation mode associated with a speaker’s authorization to operate the device, the first operation mode permitting control of at least one function associated with the device only by an authorized user, the second operation mode permitting control of any function associated with the device by any user; and

causing the device to operate in accordance with the determined first or second operation mode;

upon determining the device is in the first operation mode, the method further comprising:

identifying the authorized user; and

operating the device in accordance with a first instruction spoken by the identified authorized user; and

upon determining that the device is in the second operation mode, the method further comprising:

receiving a second instruction; and

operating the device in accordance with the received second instruction.
The non-transitory computer-readable medium of claim 71, wherein identifying the authorized user comprises:

processing one or more images captured by an image sensor onboard the device to recognize the authorized user.
The non-transitory computer-readable medium of claim 72, wherein processing the one or more images comprises:

recognizing one or more gestures from the one or more images; and

in accordance with determining that the one or more gestures satisfy predetermined criteria, recognizing that the user is the authorized user.
The non-transitory computer-readable medium of claim 72, wherein the one or more images are processed using facial recognition to recognize that the user is the authorized user.
The non-transitory computer-readable medium of claim 71, wherein identifying the authorized user further comprises:

processing an audio signal detected by an audio sensor onboard the device to recognize the authorized user.
The non-transitory computer-readable medium of claim 71, wherein identifying the authorized user further comprises:

processing an audio signal detected by an off-board device communicatively coupled to the device to recognize the authorized user.
The non-transitory computer-readable medium of claim 71, after identifying the authorized user, the non-transitory computer-readable medium further storing instructions for:

receiving the first instruction by one or more sensors onboard the device.
The non-transitory computer-readable medium of claim 71, after identifying the authorized user, the method further comprises:

receiving the first instruction by an off-board device communicatively coupled to the device.
The non-transitory computer-readable medium of claim 71, after identifying the authorized user, the non-transitory computer-readable medium further storing instructions for:

adjusting a position of an audio sensor onboard the device to receive the first instruction from the identified authorized user.
The non-transitory computer-readable medium of claim 71, the non-transitory computer-readable medium further storing instructions for:

causing the device to automatically initiate the first operation mode in accordance with determining that at least one predetermined criterion is satisfied.
The non-transitory computer-readable medium of claim 71, wherein the device is determined to be in the second operation mode, the non-transitory computer-readable medium further storing instructions for:

causing the device to switch from the second operation mode to the first operation mode in accordance with determining at least one predetermined criterion is satisfied.
The non-transitory computer-readable medium of claim 71, wherein the device is determined to be in the second operation mode, the non-transitory computer-readable medium further storing instructions for:

determining whether the second instruction is spoken by an authorized user;

in accordance with determining that the second instruction is spoken by the authorized user, operating the device in a first manner in accordance with the second instruction; and

in accordance with determining that the second instruction is not spoken by the authorized user, operating the device in a second manner in accordance with the second instruction, the second manner being distinct from the first manner.
The non-transitory computer-readable medium of claim 71, wherein the second instruction is detected by one or more sensors onboard the device.
The non-transitory computer-readable medium of claim 71, wherein the second instruction is detected by an off-board device communicatively coupled to the device.
A computer-implemented method for switching between specific and non-specific speech recognition, comprising:

receiving a speech command associated with a first person;

receiving auxiliary information associated with a second person;

determining whether the first and second person are the same person based on the received speech and auxiliary information; and

deciding whether to accept the speech command based on the determination whether the first and second person are the same person.
The computer-implemented method of claim 67, wherein accepting the speech command comprises switching to a specific speech recognition mode.
The computer-implemented method of claim 67, further comprising deciding to accept the speech command; wherein the speech command is accepted only ifthe first and second person are the same person.
The computer-implemented method of claim 67, further comprising sending instructions to reposition to receive the auxiliary information based on the received speech command.
The computer-implemented method of claim 67, wherein the auxiliary information comprises a user profile associated with the second person.
The computer-implemented method of claim 71, wherein the user profile comprises speech information associated with the second person.
The computer-implemented method of claim 71, wherein the user profile comprises gesture information associated with the second person.
The computer-implemented method of claim 71, wherein the user profile comprises facial information associated with the second person.
The computer-implemented method of claim 71, wherein determining whether the first and second person are the same person is further based on a machine learning algorithm.
A device for switching between specific and non-specific speech recognition, comprising:

one or more audio input devices configured to receive a speech command associated with a first person;

one or more sensors configured to receive auxiliary information associated with a second person;

a memory configured to store a set of instructions; and

a processor configured to execute the instructions to cause the device to:

receive the speech command via the one or more audio input devices;

receive the auxiliary information via the one or more sensors;

determine whether the first and second person are the same person based on the received speech and the auxiliary information; and

decide whether to accept the speech command based on the determination of whether the first and second person are the same person.
The device of claim 76, wherein deciding whether to accept the speech command comprises switching to a specific speech recognition mode.
The device of claim 76, further comprising deciding to accept the speech command; wherein the speech command is accepted only ifthe first and second person are the same person.
The device of claim 76, wherein the processor is further configured to send instructions to reposition to receive the auxiliary information based on the received speech command.
The device of claim 76, wherein the auxiliary information comprises a user profile associated with the second person.
The device of claim 80, wherein the user profile comprises speech information associated with the second person.
The device of claim 80, wherein the user profile comprises gesture information associated with the second person.
The device of claim 80, wherein the user profile comprises facial information associated with a second person.
The device of claim 80, wherein the user profile is stored in the memory.
The device of claim 84, wherein determining whether the first and second person are the same person is further based on a machine learning algorithm.
The device of claims 84, wherein the one or more sensors comprise one or more cameras.
A non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising, comprising:

receiving a speech command associated with a first person;

receiving auxiliary information associated with a second person;

determining whether the first and second person are the same person based on the received speech and auxiliary information; and

deciding whether to accept the speech command based on the determination whether the first and second person are the same person.
The non-transitory computer-readable medium of claim 87, wherein accepting the speech command comprises switching to a specific speech recognition mode.
The non-transitory computer-readable medium of claim 87, further comprising deciding to accept the speech command; wherein the speech command is accepted only if the first and second person are the same person.
The non-transitory computer-readable medium of claim 87, further comprising sending instructions to reposition to receive the auxiliary information based on the received speech command.
The non-transitory computer-readable medium of claim 87, wherein the auxiliary information comprises a user profile associated with the second person.
The non-transitory computer-readable medium of claim 91, wherein the user profile comprises speech information associated with the second person.
The non-transitory computer-readable medium of claim 91, wherein the user profile comprises gesture information associated with the second person.
The non-transitory computer-readable medium of claim 91, wherein the user profile comprises facial information associated with the second person.
The non-transitory computer-readable medium of claim 91, wherein determining whether the first and second person are the same person is further based on a machine learning algorithm.