WO2022111579A1 - Voice wakeup method and electronic device - Google Patents

Voice wakeup method and electronic device Download PDF

Info

Publication number
WO2022111579A1
WO2022111579A1 PCT/CN2021/133119 CN2021133119W WO2022111579A1 WO 2022111579 A1 WO2022111579 A1 WO 2022111579A1 CN 2021133119 W CN2021133119 W CN 2021133119W WO 2022111579 A1 WO2022111579 A1 WO 2022111579A1
Authority
WO
WIPO (PCT)
Prior art keywords
electronic device
user
microphone
wake
relative
Prior art date
Application number
PCT/CN2021/133119
Other languages
French (fr)
Chinese (zh)
Inventor
江昱成
赵安
林龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022111579A1 publication Critical patent/WO2022111579A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase

Definitions

  • the present application relates to the technical field of terminals, and in particular, to a voice wake-up method and an electronic device.
  • the user can wake up the electronic device by speaking a wake-up word, thereby realizing the interaction between the user and the electronic device.
  • the wake-up word is preset in the electronic device by the user, or the wake-up word is set before the electronic device leaves the factory.
  • the user may set the same wake-up word for multiple devices in order to facilitate memory. For example, the user sets the wake-up word for smart screens, smart speakers, and smart switches to be "Xiaoyi". Little Art”.
  • the present application provides a voice wake-up method and an electronic device, which help to improve the accuracy of voice wake-up of an electronic device in a multi-device scenario, thereby improving user experience.
  • an embodiment of the present application provides a voice wake-up method, which can be applied to a first electronic device, and relates to the field of terminal artificial intelligence (artificial intelligence, AI).
  • AI artificial intelligence
  • the first electronic device receives the user's voice wake-up instruction; the first electronic device also acquires the user image, and detects the user's face orientation; The user's position and the orientation of the user's face, and the target device to which the user's face is facing is determined from the first electronic device and at least one second electronic device; finally, the first electronic device instructs the target device to wake up in response to the voice instruction.
  • the first electronic device may have an image acquisition function, and the first electronic device acquires the user image from the image acquisition module, or the first electronic device may not have the image acquisition function, and the first electronic device acquires the user image from the second electronic device.
  • the first electronic device can determine the device that the user wants to wake up by using the relative positions of the first electronic device and at least one second electronic device and the face orientation of the user collected by the device. This method helps In order to improve the accuracy of device wake-up in multi-device scenarios, the application effect is relatively good.
  • the first electronic device when the second electronic device determines that the number of candidate devices to which the user's face faces is greater than or equal to two, the first electronic device needs to determine the relationship between the user and the at least two candidate devices. relative distance; then determine the priority of the candidate device according to the relative distance, wherein the smaller the relative distance, the smaller the priority of the candidate device; finally determine the candidate device corresponding to the highest priority as the target device.
  • the first electronic device needs to determine the user and the at least two candidate devices The relative distance; the candidate device corresponding to the minimum relative distance is finally determined as the target device.
  • the first electronic device may acquire information of the first audio of the first electronic device, and acquire information of the second audio from at least one second electronic device; then according to the information of the first audio and the second audio audio information to determine the user location.
  • the sound collected by the multi-microphone array of the electronic device can effectively determine the user's position, so as to ensure the accuracy of the user's position positioning result.
  • the first electronic device includes a first microphone and a second microphone; the information of the first audio includes: a first arrival time when the voice wake-up command reaches the first microphone, and a first time when the voice wake-up command reaches the second microphone.
  • the first electronic device may determine the user's location according to the information of the first audio and the information of the second audio, which specifically includes the following steps:
  • the sound collected by the multi-microphone array of the electronic device can effectively determine the user's position, so as to ensure the accuracy of the user's position positioning result.
  • the method further includes: the first electronic device acquiring historical audio information from the first electronic device and the at least one second electronic device;
  • the first electronic device obtains the arrival times and phases of the voice wake-up commands issued by the user N times to different electronic devices, where N is a positive integer; Determine the relative azimuth and distance difference corresponding to the voice wake-up command issued by the user N times;
  • the first electronic device uses the relative azimuth angle and the distance difference corresponding to the voice wake-up commands issued by the user N times as the observed values, and establishes an objective function; the first electronic device solves the objective function by an exhaustive search method to obtain the first electronic device.
  • the first electronic device may locate the relative positions of multiple devices in space according to the above method, and construct a location map including the relative positions of the devices.
  • the device can also reversely deduce the relative positions of multiple pickup devices through multiple voice recognition of the user's voice even in an interference environment. With the increase of voice wake-up messages, the positioning between devices will become more accurate.
  • the first electronic device and the at least one second electronic device are connected to the same local area network, or the first electronic device and the at least one second electronic device are pre-bound with the same user account , or the first electronic device and the at least one second electronic device are bound to different user accounts, and the different user accounts establish a binding relationship.
  • an embodiment of the present application provides a voice wake-up method, the method can be applied to a second electronic device, and the method includes:
  • the second electronic device collects the sound of the surrounding environment and converts it into the second audio, then the second electronic device sends the second audio to the first electronic device, and when the second electronic device detects the wake-up word in the second audio, it sends the second audio to the second electronic device.
  • the first electronic device sends a wake-up message, and the first electronic device can determine the user's location according to the information of the first audio and the information of the second audio of the first electronic device, and can determine the user's position according to the information of the first electronic device and the at least one second electronic device.
  • Relative position, user position and the face orientation of the user when it is determined that the target device is the second electronic device, send a wake-up response to the second electronic device, and after the second electronic device receives the wake-up response from the first electronic device , in response to the user's voice wake-up command.
  • the first electronic device determines that the target device is not the second electronic device according to the relative positions of the first electronic device and at least one second electronic device, the user's position, and the user's face orientation, The first electronic device does not send a wake-up response to the second electronic device, or sends a wake-up prohibition response to the second electronic device, and the second electronic device does not respond to the user's voice wake-up instruction.
  • the present application provides a voice wake-up system, which includes a first electronic device and at least one second electronic device.
  • the first electronic device can implement the method of any possible implementation manner of the first aspect, and at least two second electronic devices can implement the method of any possible implementation manner of the second aspect.
  • an electronic device provided by an embodiment of the present application includes: one or more processors and a memory, wherein program instructions are stored in the memory, and when the program instructions are executed by the device, the above aspects of the embodiments of the present application are implemented and any possible design method involved in the various aspects.
  • an embodiment of the present application provides a chip system, wherein the chip system is coupled with a memory in an electronic device, so that the chip system invokes program instructions stored in the memory when running to implement the embodiments of the present application.
  • a computer-readable storage medium stores program instructions, and when the program instructions are executed on an electronic device, the device enables the device to perform the above aspects of the embodiments of the present application. and any possible design method involved in the various aspects.
  • a computer program product of an embodiment of the present application when the computer program product runs on an electronic device, enables the electronic device to execute and implement the above-mentioned aspects and any possibility involved in the various aspects of the embodiments of the present application. method of design.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a mobile phone according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • FIG. 4 is an interactive schematic diagram of a voice wake-up method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a wake-up method provided by an embodiment of the present application.
  • FIG. 6A is a schematic diagram of a user location positioning method according to an embodiment of the present application.
  • FIGS. 6B to 6D are schematic diagrams of another application scenario provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of another application scenario provided by an embodiment of the present application.
  • FIG. 8A is a schematic diagram of a device location positioning method according to an embodiment of the present application.
  • 8B is a schematic diagram of a wake-up speech analysis method provided by an embodiment of the present application.
  • FIG. 8C is a schematic diagram of a device map provided by an embodiment of the present application.
  • FIG. 9 is an interactive schematic diagram of a device location positioning method provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a group of perception capability layers according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a device according to an embodiment of the application.
  • FIG. 12 is a schematic structural diagram of another device according to an embodiment of the present application.
  • the electronic device in the embodiment of the present application is an electronic device with a voice wake-up function, that is, a user can wake up the electronic device by voice. Specifically, the user wakes up the electronic device by speaking the wake-up word.
  • the wake-up word may be preset in the electronic device by the user according to his own needs, or may be set by the electronic device before leaving the factory, and the setting method of the wake-up word is not limited in this embodiment of the present application.
  • the user who wakes up the electronic device may be arbitrary or specific.
  • the specific user may be a user who pre-stores the sound of emitting the wake-up word in the electronic device, such as the owner of the device.
  • electronic devices trigger device wake-up by detecting whether a wake word is included in the audio. Specifically, when the wake-up word is included in the audio, the electronic device is awakened, otherwise the electronic device is not awakened. After the electronic device is awakened, the user can interact with the electronic device through voice. For example, the wake-up word is "Xiaoyi Xiaoyi", and when the electronic device detects that "Xiaoyi Xiaoyi" is included in the audio, the electronic device is woken up. The electronic device acquires audio by collecting or receiving ambient sound through a multi-microphone array on the device.
  • the voice including the "wake-up word" spoken by the user may be received or captured by multiple electronic devices, causing two or more electronic devices to wake up , which brings confusion to the user's voice interaction process and affects the user experience.
  • the priority of each device is usually specified manually. Assuming that the priority of the smart screen in Figure 1 is higher than the priority of the smart speaker, when both the smart screen and the smart speaker collect the "Xiaoyi Xiaoyi" spoken by the user, only the smart screen is awakened.
  • this method can restrict the wake-up of multiple devices by setting rules, it is not intelligent enough because the rules need to be manually set in advance, and the user can only actively modify the setting rules manually to adjust the wake-up according to actual needs. The priority of the device, so there is a problem of poor flexibility.
  • the embodiment of the present application provides a voice wake-up method.
  • the relative positions of the user and multiple devices in space can be located, and a location map can be constructed; In this way, by combining the location map and the user's face orientation collected by the main device, the device that the user wants to wake up can be determined.
  • This method helps to improve the accuracy of device wake-up in multi-device scenarios. , the application effect is relatively good.
  • FIG. 2 shows a schematic structural diagram of the electronic device 200 .
  • the electronic device may be a portable terminal including functions such as a personal digital assistant and/or a music player, such as a mobile phone, a tablet computer, a wearable device (such as a smart watch) with wireless communication capabilities, a vehicle-mounted device, etc. .
  • exemplary embodiments of the portable terminal include, but are not limited to, the Harmony operating system Or portable terminals of other operating systems.
  • the aforementioned portable terminal may also be, for example, a laptop computer (Laptop) having a touch-sensitive surface (eg, a touch panel). It should also be understood that, in some other embodiments, the above-mentioned terminal may also be a desktop computer having a touch-sensitive surface (eg, a touch panel).
  • FIG. 2 shows a schematic structural diagram of an electronic device 200 .
  • the electronic device 200 may include a processor 210, an external memory interface 220, an internal memory 221, a universal serial bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2 , mobile communication module 250, wireless communication module 260, audio module 270, speaker 270A, receiver 270B, microphone 270C, headphone jack 270D, sensor module 280, buttons 290, motor 291, indicator 292, camera 293, display screen 294, and Subscriber identification module (subscriber identification module, SIM) card interface 295 and so on.
  • SIM Subscriber identification module
  • the sensor module 280 may include a pressure sensor 280A, a gyroscope sensor 280B, an air pressure sensor 280C, a magnetic sensor 280D, an acceleration sensor 280E, a distance sensor 280F, a proximity light sensor 280G, a fingerprint sensor 280H, a temperature sensor 280J, a touch sensor 280K, and ambient light.
  • the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 200 .
  • the electronic device 200 may include more or less components than shown, or combine some components, or separate some components, or arrange different components.
  • the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processor
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural-network processing unit neural-network processing unit
  • the electronic device 200 implements a display function through a GPU, a display screen 294, an application processor, and the like.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 294 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the electronic device 200 may implement a shooting function through an ISP, a camera 293, a video codec, a GPU, a display screen 294, an application processor, and the like.
  • the SIM card interface 295 is used to connect a SIM card.
  • the SIM card can be contacted and separated from the electronic device 200 by inserting into the SIM card interface 295 or pulling out from the SIM card interface 295 .
  • the electronic device 200 may support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • the SIM card interface 295 can support Nano SIM cards, Micro SIM cards, SIM cards, and the like.
  • the same SIM card interface 295 can insert multiple cards at the same time.
  • the types of the plurality of cards may be the same or different.
  • the SIM card interface 295 can also be compatible with different types of SIM cards.
  • the SIM card interface 295 is also compatible with external memory cards.
  • the electronic device 200 interacts with the network through the SIM card to realize functions such as call and data communication.
  • the electronic device 200 employs an eSIM, ie: an embedded SIM card.
  • the wireless communication function of the electronic device 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modulation and demodulation processor, the baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 200 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 250 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the electronic device 200 .
  • the mobile communication module 250 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like.
  • the mobile communication module 250 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 250 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna 1 .
  • at least part of the functional modules of the mobile communication module 250 may be provided in the processor 210 .
  • at least part of the functional modules of the mobile communication module 250 may be provided in the same device as at least part of the modules of the processor 210 .
  • the wireless communication module 260 can provide applications on the electronic device 200 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), infrared (infrared radiation, IR) technology.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared radiation
  • the wireless communication module 260 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 260 receives electromagnetic waves via the antenna 2 , modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 210 .
  • the wireless communication module 260 can also receive the signal to be sent from the processor 210 , perform frequency modulation on the signal, amplify the signal, and then convert it into an electromagnetic wave for radiation through the antenna 2 .
  • the antenna 1 of the electronic device 200 is coupled with the mobile communication module 250, and the antenna 2 is coupled with the wireless communication module 260, so that the electronic device 200 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technologies may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • FIG. 2 do not constitute a specific limitation on the electronic device 200, and the electronic device 200 may also include more or less components than those shown in the figure, or combine some components, or separate some components components, or a different arrangement of components.
  • the combination/connection relationship between the components in FIG. 2 can also be adjusted and modified.
  • the software system of the electronic device may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiments of the present application take a layered architecture as an example, where the layered architecture may include a Harmony operating system or other operating systems.
  • the voice wake-up method provided by the embodiment of the present application may be applicable to a terminal integrated with the foregoing operating system.
  • FIG. 2 is the hardware structure of the electronic device to which the embodiment of the present application is applied.
  • the embodiment of the present application provides a voice wake-up method, which can utilize a location map including the user's location and the device's location. And the face orientation of the user collected by the device can accurately determine the device the user wants to wake up, and improve the accuracy of the directional device wake-up result in the multi-device scenario.
  • FIG. 3 it is a schematic diagram of a multi-device scenario to which this embodiment of the present application is applied.
  • the electronic device 10 , the electronic device 20 and the electronic device 30 are all sound pickup devices with multi-microphone arrays, and the same wake-up word is preset, for example, the wake-up word is "Little Art”.
  • the electronic device 10 When the user speaks "Xiaoyi Xiaoyi", the electronic device 10, the electronic device 20 and the electronic device 30 can all collect or receive the voice.
  • the electronic device 30 can use the wake-up voice to determine the position of the user in the device map obtained by pre-training; and a location map including the user's location, to determine which of the electronic device 10 , the electronic device 20 and the electronic device 30 the target device the face is facing is.
  • FIG. 3 is only an example of a multi-device scenario, and the embodiment of the present application does not limit the number of electronic devices in the multi-device scenario, nor does it limit the pre-set wake-up words in the electronic devices.
  • the electronic device 30 may not collect the face image, but obtain the face image collection result from other devices (such as the electronic device 10 or the electronic device 20 ).
  • 30 may be a central device with strong data processing capabilities, such as a smart speaker or a smart screen in a smart home scenario.
  • the electronic device 30 has a face image acquisition function and is the central device for description.
  • a voice wake-up method according to an embodiment of the present application will be specifically described for the voice wake-up method according to the embodiment of the present application.
  • the method flow specifically includes the following steps.
  • the electronic device 10 In steps 401a to 401c, the electronic device 10, the electronic device 20, and the electronic device 30 all collect ambient sounds in real time, and convert the collected ambient sounds into audio.
  • the multi-microphone array of the electronic device 10 collects ambient sounds and converts the collected ambient sounds into audio
  • the multi-microphone array of the electronic device 20 collects ambient sounds and converts the collected ambient sounds into audio
  • the multi-microphone array of the device 30 collects ambient sound and converts the collected ambient sound into audio.
  • the user issues a voice wake-up command of "Xiaoyi Xiaoyi" facing the smart screen shown in Converted to audio, so the sound collected by smart screens, smart speakers and smart switches will include the user's wake-up voice.
  • the electronic device 10 In steps 402a to 402c, the electronic device 10, the electronic device 20 and the electronic device 30 all perform wake word detection on the generated audio.
  • the electronic device 10, the electronic device 20, and the electronic device 30 can perform one-dimensional convolution on the data in each sliding window in the audio collected by their own devices, and extract the characteristics of different frequency bands in the data, so that when the audio recognizes an audio segment that is consistent with the user's preset voice characteristics, it means that the audio segment includes a wake-up word; otherwise, it does not include a wake-up word.
  • steps 403a to 403b when the electronic device 10 and the electronic device 20 detect the wake-up word, both send a wake-up message to the electronic device 30, and send the wake-up message with audio information generated by their own devices.
  • the electronic device 10 when the electronic device 10 detects the wake-up word, the electronic device 10 sends a first wake-up message and audio information to the electronic device 30, where the first wake-up message is used to request confirmation of whether to wake up the electronic device 10; the electronic device 20 detects the wake-up The electronic device 20 sends a second wake-up message and audio information to the electronic device 30 when the word is activated, and the second wake-up message is used to request confirmation whether to wake up the electronic device 20 .
  • the audio information may include all data of the audio, or the audio information may include information such as arrival time and phase related to the voice wake-up command.
  • the information of the first audio generated by the electronic device 30 includes: the first time of arrival when the voice wake-up command reaches the first microphone, the second time of arrival when the voice wake-up command reaches the second microphone, and the voice The wake-up command reaches the first phase of the first microphone, and the voice wake-up command reaches the second phase of the second microphone.
  • the first arrival time refers to the time when the first microphone picks up sound to the voice wake-up command
  • the second arrival time refers to the first time when the second microphone picks up the voice wake-up command.
  • the electronic device 20 includes a third microphone and a fourth microphone; the information of the second audio generated by the electronic device 20 includes: the voice wake-up command reaches the third arrival time of the third microphone, the voice wake-up command reaches the fourth arrival time of the fourth microphone, And the voice wake-up command reaches the third phase of the third microphone, and the voice wake-up command reaches the fourth phase of the fourth microphone.
  • the third arrival time refers to the time when the third microphone picks up the voice at the earliest to the voice wake-up command
  • the fourth arrival time refers to the first time when the fourth microphone picks up the voice wake-up command.
  • the electronic device 10 , the electronic device 20 , and the electronic device 30 may further include other microphones, and the number of microphones is not limited in the embodiment of the present application, and other microphones may also collect sound according to the above method.
  • the electronic device 10 or the electronic device 20 does not detect the wake-up word, there is no need to send a wake-up message to the electronic device 30 , and the electronic device 10 or the electronic device 20 only needs to send audio information to the electronic device 30 .
  • the electronic device 20 does not detect a wake-up word, it does not need to send a wake-up message to the electronic device 30 , and the electronic device 20 only needs to send audio information to the electronic device 30 , which is not shown one by one in this embodiment.
  • step 403c when the electronic device 30 also detects the wake-up word, the third wake-up message is also generated, otherwise, the third wake-up message is not generated.
  • the third wake-up message is used to request confirmation whether to wake up the electronic device 30 .
  • Step 404 the electronic device 30 determines the relative position of the user in the device map according to the pre-trained device map and the audios collected by any two electronic devices in the electronic device 10, the electronic device 20 and the electronic device 30, thereby generating an image including the user Location map of the location.
  • the information between the electronic devices is synchronized based on a multi-device interconnection technology (such as HiLink (a multi-device interconnection technology)).
  • the electronic device 10, the electronic device 20, and the electronic device 30 can be connected to the same local area network.
  • electronic device 10 electronic device 20 and electronic device 30 can be pre-bound with the same user account (such as a HUAWEI ID), or electronic device 10, electronic device 20 and electronic device 30 can be bound with different users account, and different user accounts have a binding relationship (such as pre-binding a family member's user account, that is, authorizing one's own device to connect with the family's device) to ensure secure communication between devices.
  • a binding relationship such as pre-binding a family member's user account, that is, authorizing one's own device to connect with the family's device
  • the first microphone and the second microphone of the electronic device 30 both collect sound, and record the information of the first audio.
  • the electronic device 30 may determine the electronic device 30 and the electronic device 30 according to the phase difference between the first phase of the voice wake-up command reaching the first microphone of the electronic device 30 and the second phase of the voice wake-up command reaching the second microphone of the electronic device 30 .
  • the direction angle between the users; the electronic device 30 obtains the information of the second audio from the electronic device 20, so the electronic device 30 can determine the electronic device 20 and the user according to the phase difference between the third phase and the fourth phase of the electronic device 20 direction angle between.
  • the electronic device 30 can determine the first relative distance between the electronic device 30 and the user according to the time difference between the first arrival time and the second arrival time of the electronic device 30, and furthermore, the electronic device 30 can determine the first relative distance between the electronic device 30 and the user according to the time difference between the first arrival time and the second arrival time of the electronic device 30.
  • the time difference between the third arrival time and the fourth arrival time of the electronic device 20 determines the second relative distance between the electronic device 20 and the user. In this way, the user position can be determined by combining the first azimuth angle, the second azimuth angle, the first relative distance and the second relative distance.
  • point A refers to the position of the electronic device 20 in the pre-trained device map
  • point B refers to the position of the electronic device 30 in the pre-trained device map
  • ⁇ A is The first azimuth angle of the electronic device 20 relative to the user
  • ⁇ B is the second azimuth angle of the electronic device 30 relative to the user
  • PA is the second relative distance of the electronic device 20 relative to the user
  • PB is the first relative distance of the electronic device 30 relative to the user. a relative distance.
  • the electronic device 30 can also determine the direction angle between the electronic device 20 and the user according to the phase difference between different microphones of the electronic device 10 reached by the voice wake-up command.
  • TDoA a method for localization using time difference
  • MUSIC a method for sound source localization
  • the embodiments of the present application are not limited.
  • the user location determined in the embodiment of the present application refers to a relative location. For example, the user to be located is due south of the electronic device 30, and the distance between the user and the electronic device 30 is 1 meter.
  • step 405 the electronic device 20 collects an image of the user and detects the orientation of the face.
  • the user is facing the smart screen in Figure 1 and sends out a wake-up voice of "Xiaoyi Xiaoyi", and the camera on the smart screen takes pictures or videos of the user.
  • the smart screen analyzes the face image to determine that the user's face is facing the smart screen.
  • the user is facing the smart speaker next to the smart screen in Figure 1 and sends out a wake-up voice of "Xiaoyi Xiaoyi", and the camera on the smart screen takes pictures or videos of the user.
  • the smart screen determines that the user's face is facing the first azimuth (for example, the first azimuth is the front left of the user).
  • this embodiment takes the smart screen as the main control device (or the central device) as an example for description.
  • the smart screen has an image acquisition function, so the user image collected by the smart screen is preferentially used for face orientation detection. If the main control device (or the central device) does not have the image acquisition function, the user image may also be acquired from other electronic devices with the face acquisition function, and then analyzed based on the acquired user image. The examples are not shown one by one here.
  • Step 406 the electronic device 30 determines that the target device to which the user's face faces is the electronic device 10 according to the face orientation and the location map including the user's location.
  • FIG. 6B the location map determined by the electronic device 30 and corresponding to the multi-device scenario shown in FIG. 3 is shown in FIG. 6B .
  • the face of the user faces the corresponding
  • the field of view is the range shown by the angle ⁇ shown in the figure, and the electronic device 10 exists in this range.
  • Step 407a when the electronic device 30 determines that the target device is the electronic device 10, it sends a wake-up permission instruction to the electronic device to instruct the electronic device 10 to respond to the user's wake-up voice.
  • this embodiment may further include the following steps: step 407b, the electronic device 30 may further indicate that the electronic device 30 itself is not to be awakened when it is determined that the target device is the electronic device 10, and step 407c, the electronic device 30 The device 30 sends a wake-up prohibition instruction to the electronic device 20, where the wake-up prohibition instruction is used to instruct the electronic device 20 not to respond to the user's wake-up voice.
  • the electronic device 30 may not respond to the electronic device 30 itself and the electronic device 20 when it is determined that the target device is the electronic device 10, so that the electronic device 30 and the electronic device 20 also Does not respond to the user's wake-up voice.
  • the first electronic device receives the user's voice wake-up command; the first electronic device 30 obtains the user's image from other devices locally, and detects the user's face orientation; then according to the first electronic device and at least one The relative position of the second electronic device (such as the electronic device 10 and the electronic device 20), the user's position (such as the P point position) and the orientation of the user's face, determine the position of the user's face from the first electronic device and at least one second electronic device.
  • the facing target device eg, the target device is the electronic device 10 ), and instruct the target device to respond to the voice wake-up command.
  • the above method can effectively improve the problem of "multiple responses with one call” or “multiple calls with one call”, so that when the user issues a wake-up command to the electronic device 10, the electronic device 10 will make a voice interactive response to it. , and other devices will not respond.
  • the electronic devices in the multi-device scenario need to have an image capture function, that is, as long as one device in the multi-device scenario has an image capture function and can capture a face image, it can be combined with the location Maps and face images determine the device that the user is currently pointing to, and the pointed device may not have an image acquisition function at all, so to a certain extent, this method has wider application scenarios.
  • the current position of the user can still be relocated according to the methods shown in the above steps 401a to 404, so that the collected people can still be combined Face image, accurately locate the user's position after moving.
  • the electronic device 30 can locate the user's position from the position P1 to the position P2 according to the above method. It can be seen that this method can locate the current position of the vocal user in real time.
  • the target device can also be determined to be the electronic device 20 shown in FIG. 6C in combination with the currently collected face image and the current position P2 of the user. It can be seen that, according to the above method, the awakened device can always be located in a directional manner, and is not limited by whether the user's position changes.
  • the candidate device and the user's position can be further combined.
  • the relative distance between them determines the target device, that is, the candidate device with the shortest relative distance is selected from the two or more candidate devices as the target device. That is to say, the electronic device 30 calculates the relative distance between the candidate device within the face orientation range and the user's position, and then determines the wake-up priority of each candidate device according to the size of the relative distance. The device with the smaller relative distance has the wake-up priority.
  • the electronic device 30 can directionally wake up the candidate device with the highest priority.
  • the electronic device 30 can locate the user’s position from P1 to P2 according to the above method, and then combine the currently collected face image and the user’s face image.
  • candidate devices within the visual field corresponding to the face orientation are determined to include electronic device 20 (smart speaker as shown in the figure) and electronic device 40 (smart alarm clock as shown in the figure).
  • the electronic device 30 determines that the relative distance between the electronic device 20 and P2 is D2, and determines that the relative distance between the electronic device 40 and P2 is D1. Since D1 is greater than D2, the wake-up priority of the electronic device 20 is high, Therefore, the electronic device 30 determines that the electronic device 10 is the target device to be awakened.
  • a pre-trained device map is required, that is, a relative position map of each device in a multi-device scenario needs to be constructed first, and then the relative position of the user in the device map can be determined.
  • an embodiment of the present application provides a method for training a device map, which reversely deduces the relative positions of a plurality of sound pickup devices by performing voice analysis on a user's historical wake-up speech. The construction of the device map is exemplarily given below with reference to FIGS. 7 to 9 .
  • FIG. 7 it is a schematic diagram of a multi-device scenario provided by an embodiment of the present application.
  • a plurality of sound pickup devices are deployed, wherein the main device with the image acquisition module is the smart screen device 71 in the figure, and a plurality of sound pickup devices 72a to the sound pickup device are also deployed in the space where the smart screen device 71 is located. 72f.
  • multiple speakers are used to indicate the sound pickup device 72a to the sound pickup device 72f.
  • the speakers can also be replaced with devices such as smart alarm clocks, smart cameras, and smart switches. Again, one by one is illustrated with an illustration. In the figure, only a sound box is used to indicate the sound pickup devices around the smart screen device 71 .
  • the user may wake up any sound pickup device at any position in the scene.
  • the user may sit on the sofa and call "Xiaoyi Xiaoyi” to wake up the smart screen device.
  • the user calls "Xiaoyi Xiaoyi” facing the sound pickup device 72a (such as a speaker) to wake up the sound pickup device 72a.
  • each pickup device records the user's historical wake-up voice and synchronizes it to the central device.
  • the central device can be the smart screen device shown in Figure 7, or it can be a smart speaker or router.
  • the sounding position when the user calls "Xiaoyi Xiaoyi" for the first time is the sounding point P1
  • the sounding position when the user calls "Xiaoyi Xiaoyi” for the second time is the sounding point P2.
  • the pickup device 72a and the pickup device 72b are awakened by two wake-up voices as an example.
  • the sound pickup device 72a and the sound pickup device 72b convert the collected sounds into audio and then synchronize to the central device.
  • the central device can perform voiceprint analysis on the sounds collected by the sound pickup device 72a and the sound pickup device 72b respectively.
  • the audio device 72a and the pickup device 72b can perform one-dimensional convolution on the data in each sliding window in the audio collected by their own devices, and extract the voiceprint features of different frequency bands in the data, so that when the audio is identified from the audio with When an audio clip with the same voiceprint characteristics of the wake-up voice is preset, it means that the audio clip includes the wake-up word issued by the user; otherwise, the wake-up word issued by the user is not included. As shown in FIG.
  • the central device can further The audio is filtered, and the audio that overlaps with the audio in the same period is filtered out, so as to improve the accuracy of the calculation result of the device map.
  • a polar coordinate system A is established on the sound pickup device 72a
  • a polar coordinate system B is established on the sound pickup device 72b
  • a rectangular coordinate system C is established with two points AB as the X axis.
  • ⁇ A1 is the angle between the sounding point P1 to point A and the x-axis of the rectangular coordinate system C
  • ⁇ B1 is the sounding point P1 to point B and the rectangular coordinate system C.
  • ⁇ A2 is the angle between the sound point P2 to point A and the x-axis of the rectangular coordinate system C
  • ⁇ B2 is the sound point P2 to point B and the x-axis of the rectangular coordinate system C
  • the included angle of the axis dA2 is the distance from the sounding point P2 to point A
  • dB2 is the distance from the sounding point P2 to point B.
  • the distance dA1-dB1 can be calculated by using the time difference ⁇ 1 between the wake-up signal sent by the user at point P1 and the polar coordinate system A and the polar coordinate system B (that is, the sound pickup device 72a and the sound pickup device 72b ). poor, that is:
  • dA1 is the distance from sound point P1 to point A
  • dB1 is the distance from sound point P1 to point B
  • ⁇ 1 is the time difference between the wake-up signal sent by the user at point P1 and polar coordinate system A and polar coordinate system B, namely ⁇ 1 is the time difference between the wake-up signal sent by the user and two different electronic devices
  • v is the speed of sound transmission
  • a1 is the distance difference between dA1-dB1, that is, a1 is the distance difference between the user's voice position and two different electronic devices.
  • ⁇ A1 is the angle between the sounding point P1 and point A and the x-axis of the rectangular coordinate system C
  • ⁇ B1 is the angle between the sounding point P1 and point B and the x-axis of the rectangular coordinate system C
  • b1 is the distance from P1 to point A The angle between P1 and the distance from point B.
  • ⁇ A1 and ⁇ B1 can be obtained by using the time difference and phase of the wake-up speech reaching different microphones as shown in FIG. 6A , and the calculation process is not repeated here.
  • x1 is an unknown number, which refers to dA1.
  • x 2 is an unknown number, which refers to dA2.
  • ⁇ An is the angle between the sounding point Pn to point A and the x-axis of the Cartesian coordinate system C
  • ⁇ Bn is the sounding point Pn
  • dAn is the distance from the sound point Pn to point A
  • dBn is the distance from the sound point Pn to point B.
  • x n is an unknown number, which refers to dAn.
  • a number of equations can be created simultaneously, and by solving algorithms (such as numerical solutions), the distance between devices and the angle between devices of the optimal device can be obtained, and the device can be calculated.
  • the relative positions between exemplarily, are shown in FIG. 8C . That is to say, the side length calculated by the distance divergence attenuation formula is brought into the above equation, and the distance between the devices and the angle between the devices are searched through the grid search method (Grid search), and the least squares method is obtained to minimize the loss function of all equations.
  • the device spacing and the device angle that minimize the loss function are the precise device relative positions obtained by offline learning at night.
  • the optimal relative position between the devices can be obtained.
  • the device With the increase of the accumulated user's historical wake-up voice, the device will always record the dosing data when the user calls Xiaoyi, and the device can be used at night time. Use the saved data for offline learning and training to achieve the effect of getting smarter the more you use it, and the smaller the error calculated by the exhaustive search method.
  • a device map training method is used to locate the relative positions of each device in a multi-device scenario.
  • the method flow specifically includes the following steps.
  • the electronic device 10 In steps 901a to 901c, the electronic device 10, the electronic device 20, and the electronic device 30 all collect ambient sounds in real time, and convert the collected ambient sounds into audio.
  • the multi-microphone array of the electronic device 10 collects ambient sounds and converts the collected ambient sounds into audio
  • the multi-microphone array of the electronic device 20 collects ambient sounds and converts the collected ambient sounds into audio
  • the multi-microphone array of the device 30 collects ambient sound and converts the collected ambient sound into audio.
  • the user sends out the wake-up voice of "Xiaoyi Xiaoyi" facing the smart screen shown in Figure 1. Because the smart screen, smart speakers and smart switches all collect the sound of the surrounding environment in real time and convert it into audio , smart speakers and smart switches will include the user's wake-up voice.
  • steps 902 a to 902 b the electronic device 10 and the electronic device 20 synchronize the generated audio to the electronic device 30 .
  • the electronic device 10 synchronizes the generated audio to the electronic device 30 ; the electronic device 20 synchronizes the generated audio to the electronic device 30 . It should be noted that the electronic device 10 and the electronic device 20 may synchronize the generated audio to the electronic device 30 at regular intervals, or both may synchronize the generated audio to the electronic device 30 at a fixed time point (eg, one o'clock in the morning). This application does not limit this.
  • this embodiment is described by taking a multi-device scenario including the electronic device 10 , the electronic device 20 , and the electronic device 30 as an example, and the electronic device 30 is the central device. In other possible embodiments, other electronic devices may also be included, or the central device may be other electronic devices, and so on. For other electronic devices, reference may also be made to the above-mentioned electronic devices 10 to 30, which will not be described one by one here. .
  • Step 903 the electronic device 30 analyzes the audio generated by the electronic device 10, the electronic device 20, and the electronic device 30 itself according to the method shown in FIG.
  • the calculation process of the relative positions between the multiple devices relies on the information of the historical wake-up voices.
  • the above method can continuously use the historical user wake-up voice in the recent period of time to locate the device, even if the location of the device changes during the use of the device, for example, the smart speaker is moved from the living room to the dining room, according to the above method, By accumulating the information of the historical wake-up voice within a period of time after the smart speaker is moved, the latest relative position between the devices can still be updated. The feeling to the user is that the device gets smarter the longer it is used.
  • this embodiment does not require the use environment of the sound pickup device, that is to say, even if the device is in an environment with noise interference, it is possible to reversely deduce the relationship between multiple sound pickup devices through multiple voice recognition of the user's voice. relative position between.
  • a multi-layer perception capability layer of the device is constructed, as shown in FIG. 10 , which specifically includes: a basic perception capability layer, a second layer Perceptual ability layer, high-level perception ability layer.
  • the basic perception capability layer refers to the capability after the existing functions of some devices or software are simply encapsulated by the intelligent perception framework.
  • the chip layer provides the face orientation capability, the audio source positioning capability for multi-microphone devices, and the incoming and outgoing power status monitoring capability provided by the Android layer.
  • the second-layer perception capability layer refers to the computing result obtained after processing the basic perception capability through calculation, and the capability after being encapsulated by the framework.
  • the device-to-device localization capability and the user's voice localization capability are all calculation results obtained by processing the sound localization data (basic perception capability) reported by different underlying devices.
  • the high-level perception capability layer refers to the complex calculation results reported to the upper layer calculated through complex calculations, fusions, models, and rules. For example, the directional awakening ability, through the fusion of multiple basic perception capabilities and second-level perception capabilities, the calculated advanced capabilities are finally combined.
  • a fence mechanism is also provided, which means that a virtual fence encloses a virtual boundary.
  • the upper-layer application of the mobile phone can receive automatic notifications and warnings.
  • Fence thinking is the core of intelligent perception, and each ability will be triggered with a corresponding fence.
  • Capability and fences form a platform that provides the upper layer with the ability to perceive user behavior, and upper-layer applications can receive reports when users enter and exit specific events.
  • FIG. 11 which is a schematic diagram of software modules in a multi-device scenario provided by an embodiment of the present application
  • the modules in the slave device 1 to the slave device n and the master device can cooperate to implement the device map training method provided by the embodiment of the present application.
  • the master device refers to the above-mentioned central device for locating the device position, such as the electronic device 30, and the slave device refers to the above-mentioned device for collecting wake-up voice, such as the electronic device 10 and the electronic device 20.
  • each slave device includes an audio collection module 1101
  • the master device includes an audio collection module 1101 , an audio processing module 1102 , and an audio identification module 1103 .
  • the audio collection module 1101 is used to collect the sound in the environment by using a multi-microphone array, and convert the collected sound into audio.
  • the audio collection module 1101 of each device may send the audio corresponding to each sampling period to the audio processing module 1102 of the main device (eg, the electronic device 30 ) for processing.
  • the main device may be the smart screen device shown in FIG. 3 , or may be a device with strong computing power in a smart home scenario, such as a router or a smart speaker.
  • the audio processing module 1102 of the main device is used to preprocess the audio corresponding to each sampling period, such as channel conversion, smoothing, noise reduction, etc., so that the audio recognition module 1103 can perform subsequent wake-up speech detection.
  • the audio identification module 1103 of the main device is used to identify the wake-up signal for the audio of each sampling period after preprocessing, and identify the information of the same wake-up signal from the audio of different sampling periods, such as the arrival time of the audio where the wake-up signal is located, etc. .
  • the audio recognition module 1103 sends the information identifying the same wake-up signal to the device map calculation module 1104 .
  • the device map calculation module 1104 is configured to calculate the relative positions between different devices according to the arrival time of the same wake-up voice to different devices and the arrival time of the wake-up signal to different microphones of the same device.
  • each slave device in the scenario shown in FIG. 12 may further include a user position positioning module 1105 , a face orientation recognition module 1106 and a directional wake-up module 1107 .
  • the user position positioning module 1105 is used to use the multi-microphone array in a single device to calculate the arrival time of the wake-up voice currently sent by the user to the multi-microphone device, and the arrival time of the multi-microphone device based on the wake-up voice collected by at least two devices, Calculate the position of the user's voice.
  • the face orientation recognition module 1106 is used to calculate the face orientation of the user in the location map by using the face orientation recognition capability of the chip layer within the range of the wide-angle camera of the camera of the device.
  • the directional wake-up module 1107 is used to obtain the target device to be woken up by using the device map calculated and processed by the device map calculation module 1104, the user position located by the user position location module 1105, and the face orientation.
  • FIG. 11 is only an example.
  • the electronic device of the embodiments of the present application may have more or less modules than the electronic device shown in the figure, two or more modules may be combined, and so on.
  • the various modules shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
  • the audio processing module 1102, the audio recognition module 1103, the device map calculation module 1104, the user position positioning module 1105, the face orientation recognition module 1106 and the directional wake-up module 1107 shown in FIG. 11 can be integrated in FIG. 2 Among the one or more processing units in the processor 210 shown, for example, the audio processing module 1102, the audio recognition module 1103, the device map calculation module 1104, the user location positioning module 1105, the face orientation recognition module 1106, and the directional wake-up Some or all of the modules 1107 may be integrated into one or more processors such as an application processor, a special purpose processor, or the like. It should be noted that, the dedicated processor in the embodiment of the present application may be a DSP, an application specific integrated circuit (application specific integrated circuit, ASIC) chip, or the like.
  • ASIC application specific integrated circuit
  • FIG. 12 shows a device 1200 provided by the present application.
  • Device 1200 includes at least one processor 1210 , memory 1220 and transceiver 1230 .
  • the processor 1210 is coupled with the memory 1220 and the transceiver 1230.
  • the coupling in this embodiment of the present application is an indirect coupling or communication connection between devices, units, or modules, which may be electrical, mechanical, or other forms for the device , information exchange between units or modules.
  • the connection medium between the transceiver 1230, the processor 1210, and the memory 1220 is not limited in the embodiments of the present application.
  • the memory 1220 , the processor 1210 , and the transceiver 1230 may be connected through a bus, and the bus may be divided into an address bus, a data bus, a control bus, and the like.
  • the memory 1220 is used to store program instructions.
  • the transceiver 1230 is used to receive or transmit data.
  • the processor 1210 is configured to invoke the program instructions stored in the memory 1220, so that the device 1200 executes the steps performed by the electronic device 30 or the steps performed by the electronic device 10 or the electronic device 20 in the above method.
  • the processor 1210 may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which can implement Alternatively, each method, step, and logic block diagram disclosed in the embodiments of the present application are executed.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
  • the memory 1220 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), Such as random-access memory (random-access memory, RAM).
  • Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory in this embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, for storing program instructions and/or data.
  • the device 1200 can be used to implement the methods shown in the embodiments of the present application, and the relevant features can be referred to above, which will not be repeated here.
  • Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • a storage medium can be any available medium that a computer can access.
  • computer readable media may include RAM, ROM, electrically erasable programmable read only memory (EEPROM), compact disc read-Only memory (CD- ROM) or other optical disk storage, magnetic disk storage media, or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and that can be accessed by a computer. also. Any connection can be appropriately made into a computer-readable medium.
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable read only memory
  • CD- ROM compact disc read-Only memory
  • Any connection can be appropriately made into a computer-readable medium.
  • disks and discs include compact discs (CDs), laser discs, optical discs, digital video discs (DVDs), floppy disks, and Blu-ray discs, wherein Disks usually reproduce data magnetically, while discs use lasers to reproduce data optically. Combinations of the above should also be included within the scope of computer-readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Navigation (AREA)

Abstract

A voice wakeup method and an electronic device, relating to the field of terminal artificial intelligence. The method comprises: by utilizing ambient sound acquired by each device, on one hand, relative positions of a user and the multiple devices in a space can be positioned to construct a position map; and on the other hand, a face orientation of the user can be acquired by a master device having an image acquisition module in the multiple devices. Thus, a device that the user wants to wake up can be determined by combining the position map and the face orientation of the user acquired by the master device. The method facilitates improving the accuracy of device wakeup in a multi-device environment, and the application effect is relatively good.

Description

一种语音唤醒方法及电子设备A kind of voice wake-up method and electronic device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求在2020年11月27日提交中国专利局、申请号为202011362525.1、申请名称为“一种语音唤醒方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011362525.1 and the application title "A voice wake-up method and electronic device" filed with the China Patent Office on November 27, 2020, the entire contents of which are incorporated into this application by reference middle.
技术领域technical field
本申请涉及终端技术领域,尤其涉及一种语音唤醒方法及电子设备。The present application relates to the technical field of terminals, and in particular, to a voice wake-up method and an electronic device.
背景技术Background technique
目前,用户可以通过说出唤醒词来唤醒电子设备,从而实现用户与电子设备之间的交互。通常情况下,唤醒词是由用户预先设置在电子设备中的,或者唤醒词是电子设备出厂之前设置好的。在多设备场景(如智能家居场景)下,用户为了方便记忆,可能会将多个设备设置了相同的唤醒词,例如,用户设置智慧屏、智能音箱、智能开关的唤醒词均为“小艺小艺”。如图1所示,假设用户只希望唤醒智慧屏,可是当用户说出“小艺小艺”后,智慧屏和音箱均被唤醒,均向用户反馈“我在”的语音响应,这样情况会给用户带来困扰,影响用户体验。Currently, the user can wake up the electronic device by speaking a wake-up word, thereby realizing the interaction between the user and the electronic device. Usually, the wake-up word is preset in the electronic device by the user, or the wake-up word is set before the electronic device leaves the factory. In a multi-device scenario (such as a smart home scenario), the user may set the same wake-up word for multiple devices in order to facilitate memory. For example, the user sets the wake-up word for smart screens, smart speakers, and smart switches to be "Xiaoyi". Little Art". As shown in Figure 1, it is assumed that the user only wants to wake up the smart screen, but when the user says "Xiaoyi Xiaoyi", the smart screen and the speaker are both awakened, and both feedback the voice response of "I am" to the user. Trouble the user and affect the user experience.
发明内容SUMMARY OF THE INVENTION
本申请提供一种语音唤醒方法及电子设备,有助于提高在多设备场景中语音唤醒电子设备的准确性,从而提高用户体验。The present application provides a voice wake-up method and an electronic device, which help to improve the accuracy of voice wake-up of an electronic device in a multi-device scenario, thereby improving user experience.
第一方面,本申请实施例提供一种语音唤醒方法,该方法可以应用于第一电子设备,该方法涉及终端人工智能(artificial intelligence,AI)领域。该方法包括:In a first aspect, an embodiment of the present application provides a voice wake-up method, which can be applied to a first electronic device, and relates to the field of terminal artificial intelligence (artificial intelligence, AI). The method includes:
第一电子设备接收用户的语音唤醒指令;第一电子设备还获取用户图像,检测用户人脸朝向;接着,第一电子设备根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,从所述第一电子设备和至少一个第二电子设备中确定用户人脸所朝向的目标设备;最终第一电子设备指示所述目标设备响应所述语音唤醒指令。The first electronic device receives the user's voice wake-up instruction; the first electronic device also acquires the user image, and detects the user's face orientation; The user's position and the orientation of the user's face, and the target device to which the user's face is facing is determined from the first electronic device and at least one second electronic device; finally, the first electronic device instructs the target device to wake up in response to the voice instruction.
其中,第一电子设备可以具有图像采集功能,第一电子设备从图像采集模块获取用户图像,第一电子设备也可以不具有图像采集功能,第一电子设备从第二电子设备获取用户图像。The first electronic device may have an image acquisition function, and the first electronic device acquires the user image from the image acquisition module, or the first electronic device may not have the image acquisition function, and the first electronic device acquires the user image from the second electronic device.
本申请实施例中,第一电子设备通过第一电子设备和至少一个第二电子设备的相对位置,和设备所采集用户的人脸朝向,就可以确定用户想要唤醒的设备,该方法有助于提高多设备场景中设备唤醒的准确性,应用效果也相对较好。In the embodiment of the present application, the first electronic device can determine the device that the user wants to wake up by using the relative positions of the first electronic device and at least one second electronic device and the face orientation of the user collected by the device. This method helps In order to improve the accuracy of device wake-up in multi-device scenarios, the application effect is relatively good.
在一种可能的设计中,当第二电子设确定用户人脸所朝向的候选设备的个数大于或等于两个时,第一电子设备需要确定所述用户和所述至少两个候选设备的相对距离;然后根据所述相对距离,确定所述候选设备的优先级,其中,相对距离越小的候选设备的优先级越小;最终确定最高优先级对应的候选设备为目标设备。In a possible design, when the second electronic device determines that the number of candidate devices to which the user's face faces is greater than or equal to two, the first electronic device needs to determine the relationship between the user and the at least two candidate devices. relative distance; then determine the priority of the candidate device according to the relative distance, wherein the smaller the relative distance, the smaller the priority of the candidate device; finally determine the candidate device corresponding to the highest priority as the target device.
本申请实施例中,在指向性唤醒的基础上,有助于提高多设备场景中就近唤醒的准确性。In the embodiment of the present application, on the basis of directional wake-up, it helps to improve the accuracy of nearby wake-up in a multi-device scenario.
在另一种可能的设计中,当第二电子设确定用户人脸所朝向的候选设备的个数大于或等于两个时,第一电子设备需要确定所述用户和所述至少两个候选设备的相对距离;最终确定最小相对距离对应的候选设备为目标设备。In another possible design, when the second electronic device determines that the number of candidate devices to which the user's face faces is greater than or equal to two, the first electronic device needs to determine the user and the at least two candidate devices The relative distance; the candidate device corresponding to the minimum relative distance is finally determined as the target device.
本申请实施例中,在指向性唤醒的基础上,有助于提高多设备场景中就近唤醒的准确性。In the embodiment of the present application, on the basis of directional wake-up, it helps to improve the accuracy of nearby wake-up in a multi-device scenario.
在一种可能的设计中,第一电子设备可以获取第一电子设备的第一音频的信息,以及从至少一个第二电子设备获取第二音频的信息;然后根据第一音频的信息和第二音频的信息,确定所述用户位置。In a possible design, the first electronic device may acquire information of the first audio of the first electronic device, and acquire information of the second audio from at least one second electronic device; then according to the information of the first audio and the second audio audio information to determine the user location.
本申请实施例中,通过电子设备的多麦克风阵列所采集到声音可以有效地确定用户位置,保证用户位置定位结果的准确性。In this embodiment of the present application, the sound collected by the multi-microphone array of the electronic device can effectively determine the user's position, so as to ensure the accuracy of the user's position positioning result.
在一种可能的设计中,第一电子设备包括第一麦克风和第二麦克风;第一音频的信息包括:语音唤醒指令到达第一麦克风的第一到达时刻,语音唤醒指令到达第二麦克风的第二到达时刻,以及语音唤醒指令到达第一麦克风的第一相位、语音唤醒指令到达第二麦克风的第二相位;至少一个第二电子设备包括第三麦克风和第四麦克风;第二音频的信息包括:语音唤醒指令到达第三麦克风的第三到达时刻,语音唤醒指令到达第四麦克风的第四到达时刻,以及语音唤醒指令到达第三麦克风的第三相位、语音唤醒指令到达第四麦克风的第四相位。In a possible design, the first electronic device includes a first microphone and a second microphone; the information of the first audio includes: a first arrival time when the voice wake-up command reaches the first microphone, and a first time when the voice wake-up command reaches the second microphone. The second arrival time, and the first phase when the voice wake-up command reaches the first microphone, and the second phase when the voice wake-up command reaches the second microphone; at least one second electronic device includes a third microphone and a fourth microphone; the information of the second audio includes : The voice wake-up command reaches the third arrival time of the third microphone, the voice wake-up command reaches the fourth arrival time of the fourth microphone, and the voice wake-up command reaches the third phase of the third microphone, and the voice wake-up command reaches the fourth microphone of the fourth microphone. phase.
在一种可能的设计中,第一电子设备可以根据第一音频的信息和第二音频的信息,确定用户位置,具体包括如下步骤:In a possible design, the first electronic device may determine the user's location according to the information of the first audio and the information of the second audio, which specifically includes the following steps:
根据第一到达时刻和第二到达时刻之间的时间差,确定用户和第一电子设备的相对距离,以及根据第一相位和第二相位之间的相位差,确定用户相对于第一电子设备的方位角;根据第二到达时刻和第三到达时刻之间的时间差,确定用户和至少一个第二电子设备的相对距离;以及根据第三相位和第四相位之间的相位差,确定用户相对于至少一个第二电子设备的方位角;根据用户和第一电子设备的相对距离、用户相对于第一电子设备的方位角和用户和至少一个第二电子设备的相对距离,以及用户相对于至少一个第二电子设备的方位角,确定用户位置。Determine the relative distance between the user and the first electronic device according to the time difference between the first arrival moment and the second arrival moment, and determine the relative distance between the user and the first electronic device according to the phase difference between the first phase and the second phase Azimuth; determining the relative distance between the user and the at least one second electronic device according to the time difference between the second arrival moment and the third arrival moment; and determining the relative distance between the user and the at least one second electronic device according to the phase difference between the third phase and the fourth phase Azimuth of at least one second electronic device; based on the relative distance between the user and the first electronic device, the user's azimuth relative to the first electronic device and the relative distance between the user and the at least one second electronic device, and the user relative to the at least one The azimuth of the second electronic device determines the user's location.
本申请实施例中,通过电子设备的多麦克风阵列所采集到声音可以有效地确定用户位置,保证用户位置定位结果的准确性。In this embodiment of the present application, the sound collected by the multi-microphone array of the electronic device can effectively determine the user's position, so as to ensure the accuracy of the user's position positioning result.
在一种可能的设计中,所述方法还包括:第一电子设备从所述第一电子设备和所述至少一个第二电子设备获取历史音频的信息;In a possible design, the method further includes: the first electronic device acquiring historical audio information from the first electronic device and the at least one second electronic device;
第一电子设备从该历史音频的信息中,获取所述用户N次发出的语音唤醒指令到达不同电子设备的到达时刻和相位,其中N为正整数;然后第一电子设备根据所述用户N次发出的语音唤醒指令到达不同电子设备的到达时刻和相位,确定与用户N次发出的语音唤醒指令对应的相对方位角和距离差;From the historical audio information, the first electronic device obtains the arrival times and phases of the voice wake-up commands issued by the user N times to different electronic devices, where N is a positive integer; Determine the relative azimuth and distance difference corresponding to the voice wake-up command issued by the user N times;
第一电子设备以用户N次发出的语音唤醒指令对应的相对方位角和距离差为观测值,建立目标函数;第一电子设备通过穷举搜索的方法对所述目标函数求解,得到所述第一电子设备和至少一个第二电子设备的相对位置。The first electronic device uses the relative azimuth angle and the distance difference corresponding to the voice wake-up commands issued by the user N times as the observed values, and establishes an objective function; the first electronic device solves the objective function by an exhaustive search method to obtain the first electronic device. The relative position of an electronic device and at least one second electronic device.
本申请实施例中,第一电子设备可以按照上述方法定位多设备在空间中的相对位置, 构建包括设备相对位置的位置地图。使得设备在干扰环境下也可以通过用户语音多个语音识别反向推导出多个拾音设备之间的相对位置,随着语音唤醒消息的增多,设备之间的定位会越发准确。In this embodiment of the present application, the first electronic device may locate the relative positions of multiple devices in space according to the above method, and construct a location map including the relative positions of the devices. The device can also reversely deduce the relative positions of multiple pickup devices through multiple voice recognition of the user's voice even in an interference environment. With the increase of voice wake-up messages, the positioning between devices will become more accurate.
在一种可能的设计中,所述第一电子设备和所述至少一个第二电子设备连接同一局域网,或所述第一电子设备和所述至少一个第二电子设备预先绑定有同一用户账号,或者所述第一电子设备和所述至少一个第二电子设备绑定不同的用户帐号,且不同的用户帐号建立绑定关系。In a possible design, the first electronic device and the at least one second electronic device are connected to the same local area network, or the first electronic device and the at least one second electronic device are pre-bound with the same user account , or the first electronic device and the at least one second electronic device are bound to different user accounts, and the different user accounts establish a binding relationship.
第二方面,本申请实施例提供一种语音唤醒方法,该方法可以应用于第二电子设备,该方法包括:In a second aspect, an embodiment of the present application provides a voice wake-up method, the method can be applied to a second electronic device, and the method includes:
第二电子设备采集周围环境的声音并转换成第二音频,然后第二电子设备将第二音频发送至第一电子设备,以及第二电子设备在检测到第二音频中的唤醒词时,向第一电子设备发送唤醒消息,第一电子设备可以根据第一音频的信息以及第一电子设备的第二音频的信息,确定用户位置,并在根据第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,确定目标设备是该第二电子设备的情况下,向该第二电子设备发送唤醒响应,第二电子设备接收来自第一电子设备的唤醒响应后,响应用户语音唤醒指令。The second electronic device collects the sound of the surrounding environment and converts it into the second audio, then the second electronic device sends the second audio to the first electronic device, and when the second electronic device detects the wake-up word in the second audio, it sends the second audio to the second electronic device. The first electronic device sends a wake-up message, and the first electronic device can determine the user's location according to the information of the first audio and the information of the second audio of the first electronic device, and can determine the user's position according to the information of the first electronic device and the at least one second electronic device. Relative position, user position and the face orientation of the user, when it is determined that the target device is the second electronic device, send a wake-up response to the second electronic device, and after the second electronic device receives the wake-up response from the first electronic device , in response to the user's voice wake-up command.
另一种可能的情况下,若第一电子设备根据第一电子设备和至少一个第二电子设备的相对位置、用户位置和用户人脸朝向,确定目标设备不是该第二电子设备的情况下,第一电子设备不向第二电子设备发送唤醒响应,或者向第二电子设备发送禁止唤醒的响应,第二电子设备不对用户语音唤醒指令作出响应。In another possible situation, if the first electronic device determines that the target device is not the second electronic device according to the relative positions of the first electronic device and at least one second electronic device, the user's position, and the user's face orientation, The first electronic device does not send a wake-up response to the second electronic device, or sends a wake-up prohibition response to the second electronic device, and the second electronic device does not respond to the user's voice wake-up instruction.
第三方面,本申请提供了一种语音唤醒系统,该系统包括第一电子设备和至少一个第二电子设备。第一电子设备能够实现上述第一方面的任意一种可能的实现方式的方法,至少两个第二电子设备能够实现上述第二方面的任意一种可能的实现方式的方法。In a third aspect, the present application provides a voice wake-up system, which includes a first electronic device and at least one second electronic device. The first electronic device can implement the method of any possible implementation manner of the first aspect, and at least two second electronic devices can implement the method of any possible implementation manner of the second aspect.
第四方面,本申请实施例提供的一种电子设备,包括:一个或多个处理器和存储器,其中存储器中存储有程序指令,当程序指令被设备执行时,实现本申请实施例上述各个方面以及各个方面涉及的任一可能设计的方法。In a fourth aspect, an electronic device provided by an embodiment of the present application includes: one or more processors and a memory, wherein program instructions are stored in the memory, and when the program instructions are executed by the device, the above aspects of the embodiments of the present application are implemented and any possible design method involved in the various aspects.
第五方面,本申请实施例提供的一种芯片系统,所述芯片系统与电子设备中的存储器耦合,使得所述芯片系统在运行时调用所述存储器中存储的程序指令,实现本申请实施例上述各个方面以及各个方面涉及的任一可能设计的方法。In a fifth aspect, an embodiment of the present application provides a chip system, wherein the chip system is coupled with a memory in an electronic device, so that the chip system invokes program instructions stored in the memory when running to implement the embodiments of the present application The various aspects described above, as well as the methods of any possible designs involved in the various aspects.
第六方面,本申请实施例的一种计算机可读存储介质,该计算机可读存储介质存储有程序指令,当所述程序指令在电子设备上运行时,使得设备执行本申请实施例上述各个方面以及各个方面涉及的任一可能设计的方法。In a sixth aspect, a computer-readable storage medium according to an embodiment of the present application stores program instructions, and when the program instructions are executed on an electronic device, the device enables the device to perform the above aspects of the embodiments of the present application. and any possible design method involved in the various aspects.
第七方面,本申请实施例的一种计算机程序产品,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行实现本申请实施例上述各个方面以及各个方面涉及的任一可能设计的方法。In a seventh aspect, a computer program product of an embodiment of the present application, when the computer program product runs on an electronic device, enables the electronic device to execute and implement the above-mentioned aspects and any possibility involved in the various aspects of the embodiments of the present application. method of design.
另外,第四方面至第七方面中任一种可能设计方式所带来的技术效果可参见方法部分相关中不同设计方式所带来的技术效果,此处不再赘述。In addition, for the technical effect brought by any of the possible design methods in the fourth aspect to the seventh aspect, reference may be made to the technical effect brought by different design methods in the related method section, which will not be repeated here.
附图说明Description of drawings
图1为本申请实施例提供的一种应用场景示意图;1 is a schematic diagram of an application scenario provided by an embodiment of the present application;
图2为本申请实施例提供的一种手机结构示意图;2 is a schematic structural diagram of a mobile phone according to an embodiment of the present application;
图3为本申请实施例提供的另一种应用场景示意图;FIG. 3 is a schematic diagram of another application scenario provided by an embodiment of the present application;
图4为本申请实施例提供的一种语音唤醒方法交互示意图;FIG. 4 is an interactive schematic diagram of a voice wake-up method provided by an embodiment of the present application;
图5为本申请实施例提供的一种唤醒方法示意图;FIG. 5 is a schematic diagram of a wake-up method provided by an embodiment of the present application;
图6A为本申请实施例提供的一种用户位置定位方法示意图;FIG. 6A is a schematic diagram of a user location positioning method according to an embodiment of the present application;
图6B至图6D为本申请实施例提供的另一种应用场景示意图;6B to 6D are schematic diagrams of another application scenario provided by the embodiment of the present application;
图7为本申请实施例提供的另一种应用场景示意图;FIG. 7 is a schematic diagram of another application scenario provided by an embodiment of the present application;
图8A为本申请实施例提供的一种设备位置定位方法示意图;FIG. 8A is a schematic diagram of a device location positioning method according to an embodiment of the present application;
图8B为本申请实施例提供的一种唤醒语音分析方法示意图;8B is a schematic diagram of a wake-up speech analysis method provided by an embodiment of the present application;
图8C为本申请实施例提供的一种设备地图示意图;FIG. 8C is a schematic diagram of a device map provided by an embodiment of the present application;
图9为本申请实施例提供的一种设备位置定位方法交互示意图;FIG. 9 is an interactive schematic diagram of a device location positioning method provided by an embodiment of the present application;
图10为本申请实施例的一组感知能力层示意图;FIG. 10 is a schematic diagram of a group of perception capability layers according to an embodiment of the present application;
图11为本申请实施例的一种设备的结构示意图;11 is a schematic structural diagram of a device according to an embodiment of the application;
图12为本申请实施例的另一设备的结构示意图。FIG. 12 is a schematic structural diagram of another device according to an embodiment of the present application.
具体实施方式Detailed ways
应理解,在本申请中除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本申请中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“至少一个”是指一个或者多个,“多个”是指两个或两个以上。It should be understood that, unless otherwise specified in this application, "/" means or means, for example, A/B can mean A or B; "and/or" in this application is only an association relationship that describes an associated object , which means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. "At least one" means one or more, and "plurality" means two or more.
在本申请中,“示例的”、“在一些实施例中”、“在另一些实施例中”等用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。In this application, "exemplary," "in some embodiments," "in other embodiments," and the like are used to mean by way of example, illustration, or illustration. Any embodiment or design described in this application as "exemplary" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the word example is intended to present a concept in a concrete way.
另外,本申请中涉及的“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量,也不能理解为指示或暗示顺序。In addition, terms such as "first" and "second" involved in this application are only used for the purpose of distinguishing and describing, and should not be interpreted as indicating or implying relative importance or implicitly indicating the quantity of the indicated technical features. Not to be construed as indicating or implying order.
本申请实施例中的电子设备为具有语音唤醒功能的电子设备,即用户可以通过语音唤醒电子设备。具体的,用户是通过说出唤醒词唤醒电子设备的。其中,唤醒词可以是用户根据自身需要预先设置在电子设备中的,也可以是电子设备在出厂之前设置好的,本申请实施例中对唤醒词的设置方式不作限定。需要说明的是,本申请实施例中唤醒电子设备的用户可以是任意的,也可以是特定的。示例的,特定用户可以为预先将发出唤醒词的声音存储在电子设备中的用户,如机主。The electronic device in the embodiment of the present application is an electronic device with a voice wake-up function, that is, a user can wake up the electronic device by voice. Specifically, the user wakes up the electronic device by speaking the wake-up word. The wake-up word may be preset in the electronic device by the user according to his own needs, or may be set by the electronic device before leaving the factory, and the setting method of the wake-up word is not limited in this embodiment of the present application. It should be noted that, in this embodiment of the present application, the user who wakes up the electronic device may be arbitrary or specific. For example, the specific user may be a user who pre-stores the sound of emitting the wake-up word in the electronic device, such as the owner of the device.
目前,电子设备是通过检测音频中是否包括唤醒词来触发设备唤醒的。具体的,当音频中包括唤醒词时,电子设备被唤醒,否则电子设备不被唤醒。电子设备被唤醒后,用户可以通过语音与电子设备实现交互。例如,唤醒词为“小艺小艺”,电子设备检测到音频中包括“小艺小艺”时,电子设备被唤醒。其中,电子设备是通过设备上的多麦克风阵列采集或接收周围环境声音以获取音频的。然而,在多设备场景(例如智能家居场景)中,用户说出的包括“唤醒词”的语音有可能会被多个电子设备接收或采集到,导致两个或两个以上的电子设备被唤醒,给用户的语音交互过程带来困惑,影响用户体验。Currently, electronic devices trigger device wake-up by detecting whether a wake word is included in the audio. Specifically, when the wake-up word is included in the audio, the electronic device is awakened, otherwise the electronic device is not awakened. After the electronic device is awakened, the user can interact with the electronic device through voice. For example, the wake-up word is "Xiaoyi Xiaoyi", and when the electronic device detects that "Xiaoyi Xiaoyi" is included in the audio, the electronic device is woken up. The electronic device acquires audio by collecting or receiving ambient sound through a multi-microphone array on the device. However, in a multi-device scenario (such as a smart home scenario), the voice including the "wake-up word" spoken by the user may be received or captured by multiple electronic devices, causing two or more electronic devices to wake up , which brings confusion to the user's voice interaction process and affects the user experience.
目前为了解决多设备场景下存在设备被误唤醒的问题,通常会人为规定各个设备的优 先级。假设人为预先规定图1中的智慧屏的优先级高于智能音箱的优先级,则当智慧屏和智能音箱均采集到用户说出的“小艺小艺”时,只有智慧屏被唤醒。虽然,该方法通过设置规则可以对多设备被唤醒的情况进行约束,但因规则需要人为预先设置,不够智能化,且用户只能通过手动地方式主动修改设置规则,才能根据实际需要调整被唤醒设备的优先级,所以存在灵活性差的问题。At present, in order to solve the problem of devices being woken up by mistake in multi-device scenarios, the priority of each device is usually specified manually. Assuming that the priority of the smart screen in Figure 1 is higher than the priority of the smart speaker, when both the smart screen and the smart speaker collect the "Xiaoyi Xiaoyi" spoken by the user, only the smart screen is awakened. Although this method can restrict the wake-up of multiple devices by setting rules, it is not intelligent enough because the rules need to be manually set in advance, and the user can only actively modify the setting rules manually to adjust the wake-up according to actual needs. The priority of the device, so there is a problem of poor flexibility.
另外,也存在相关技术通过监测人脸朝向来触发人脸所朝向的设备被唤醒,该方法虽然可以提高了人机交互方法的灵活性和准确度。但是由于大部分设备并不具有图像采集功能,导致应用效果并不好。例如:智能家居场景下的智能音箱和智能声控开关等设备受成本约束,一般不会集成图像采集模块,所以导致基于人脸朝向来指向性唤醒设备的方法无法应用于该类设备。In addition, there is also a related technology that triggers the device facing the face to wake up by monitoring the face orientation, although this method can improve the flexibility and accuracy of the human-computer interaction method. However, because most devices do not have the image acquisition function, the application effect is not good. For example, devices such as smart speakers and smart voice-activated switches in smart home scenarios are subject to cost constraints, and generally do not integrate image acquisition modules, so the method of directional wake-up based on face orientation cannot be applied to such devices.
因此,现有的语音唤醒方法应用于多设备场景时,仍无法有效地解决多个设备同时被唤醒的问题。有鉴于此,本申请实施例提供了一种语音唤醒方法,一方面可以定位用户和多设备在空间中的相对位置,构建位置地图;另一方面可以借助多设备中有图像采集模块的主设备来采集用户的人脸朝向,这样,通过结合位置地图和主设备所采集用户的人脸朝向,就可以确定用户想要唤醒的设备,该方法有助于提高多设备场景中设备唤醒的准确性,应用效果也相对较好。Therefore, when the existing voice wake-up method is applied to a multi-device scenario, it still cannot effectively solve the problem that multiple devices are woken up at the same time. In view of this, the embodiment of the present application provides a voice wake-up method. On the one hand, the relative positions of the user and multiple devices in space can be located, and a location map can be constructed; In this way, by combining the location map and the user's face orientation collected by the main device, the device that the user wants to wake up can be determined. This method helps to improve the accuracy of device wake-up in multi-device scenarios. , the application effect is relatively good.
下文以电子设备为例,图2示出了电子设备200的结构示意图。Hereinafter, an electronic device is taken as an example, and FIG. 2 shows a schematic structural diagram of the electronic device 200 .
本申请实施例提供的语音唤醒方法可以应用于电子设备中。在一些实施例中,电子设备可以是包含诸如个人数字助理和/或音乐播放器等功能的便携式终端,诸如手机、平板电脑、具备无线通讯功能的可穿戴设备(如智能手表)、车载设备等。便携式终端的示例性实施例包括但不限于搭载鸿蒙(Harmony)操作系统
Figure PCTCN2021133119-appb-000001
Figure PCTCN2021133119-appb-000002
或者其它操作系统的便携式终端。上述便携式终端也可以是诸如具有触敏表面(例如触控面板)的膝上型计算机(Laptop)等。还应当理解的是,在其他一些实施例中,上述终端也可以是具有触敏表面(例如触控面板)的台式计算机。
The voice wake-up method provided by the embodiment of the present application can be applied to an electronic device. In some embodiments, the electronic device may be a portable terminal including functions such as a personal digital assistant and/or a music player, such as a mobile phone, a tablet computer, a wearable device (such as a smart watch) with wireless communication capabilities, a vehicle-mounted device, etc. . Exemplary embodiments of the portable terminal include, but are not limited to, the Harmony operating system
Figure PCTCN2021133119-appb-000001
Figure PCTCN2021133119-appb-000002
Or portable terminals of other operating systems. The aforementioned portable terminal may also be, for example, a laptop computer (Laptop) having a touch-sensitive surface (eg, a touch panel). It should also be understood that, in some other embodiments, the above-mentioned terminal may also be a desktop computer having a touch-sensitive surface (eg, a touch panel).
图2示出了电子设备200的结构示意图。FIG. 2 shows a schematic structural diagram of an electronic device 200 .
电子设备200可包括处理器210、外部存储器接口220、内部存储器221、通用串行总线(universal serial bus,USB)接口230、充电管理模块240、电源管理模块241,电池242、天线1、天线2、移动通信模块250、无线通信模块260、音频模块270、扬声器270A、受话器270B、麦克风270C、耳机接口270D、传感器模块280、按键290、马达291、指示器292、摄像头293、显示屏294、以及用户标识模块(subscriber identification module,SIM)卡接口295等。其中传感器模块280可以包括压力传感器280A、陀螺仪传感器280B、气压传感器280C、磁传感器280D、加速度传感器280E、距离传感器280F、接近光传感器280G、指纹传感器280H、温度传感器280J、触摸传感器280K、环境光传感器280L、骨传导传感器280M等。The electronic device 200 may include a processor 210, an external memory interface 220, an internal memory 221, a universal serial bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2 , mobile communication module 250, wireless communication module 260, audio module 270, speaker 270A, receiver 270B, microphone 270C, headphone jack 270D, sensor module 280, buttons 290, motor 291, indicator 292, camera 293, display screen 294, and Subscriber identification module (subscriber identification module, SIM) card interface 295 and so on. The sensor module 280 may include a pressure sensor 280A, a gyroscope sensor 280B, an air pressure sensor 280C, a magnetic sensor 280D, an acceleration sensor 280E, a distance sensor 280F, a proximity light sensor 280G, a fingerprint sensor 280H, a temperature sensor 280J, a touch sensor 280K, and ambient light. Sensor 280L, bone conduction sensor 280M, etc.
可以理解的是,本申请实施例示意的结构并不构成对电子设备200的具体限定。在本申请另一些实施例中,电子设备200可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 200 . In other embodiments of the present application, the electronic device 200 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
处理器210可以包括一个或多个处理单元,例如:处理器210可以包括应用处理器(application processor,AP)、调制解调处理器、图形处理器(graphics processing unit,GPU)、 图像信号处理器(image signal processor,ISP)、控制器、视频编解码器、数字信号处理器(digital signal processor,DSP)、基带处理器、和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 210 may include one or more processing units, for example, the processor 210 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
电子设备200通过GPU,显示屏294,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏294和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器210可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device 200 implements a display function through a GPU, a display screen 294, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 294 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 210 may include one or more GPUs that execute program instructions to generate or alter display information.
电子设备200可以通过ISP、摄像头293、视频编解码器、GPU、显示屏294以及应用处理器等实现拍摄功能。The electronic device 200 may implement a shooting function through an ISP, a camera 293, a video codec, a GPU, a display screen 294, an application processor, and the like.
SIM卡接口295用于连接SIM卡。SIM卡可以通过插入SIM卡接口295,或从SIM卡接口295拔出,实现和电子设备200的接触和分离。电子设备200可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口295可以支持Nano SIM卡、Micro SIM卡、SIM卡等。同一个SIM卡接口295可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口295也可以兼容不同类型的SIM卡。SIM卡接口295也可以兼容外部存储卡。电子设备200通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,电子设备200采用eSIM,即:嵌入式SIM卡。The SIM card interface 295 is used to connect a SIM card. The SIM card can be contacted and separated from the electronic device 200 by inserting into the SIM card interface 295 or pulling out from the SIM card interface 295 . The electronic device 200 may support 1 or N SIM card interfaces, where N is a positive integer greater than 1. The SIM card interface 295 can support Nano SIM cards, Micro SIM cards, SIM cards, and the like. The same SIM card interface 295 can insert multiple cards at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 295 can also be compatible with different types of SIM cards. The SIM card interface 295 is also compatible with external memory cards. The electronic device 200 interacts with the network through the SIM card to realize functions such as call and data communication. In some embodiments, the electronic device 200 employs an eSIM, ie: an embedded SIM card.
电子设备200的无线通信功能可以通过天线1、天线2、移动通信模块250、无线通信模块260、调制解调处理器以及基带处理器等实现。天线1和天线2用于发射和接收电磁波信号。电子设备200中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。The wireless communication function of the electronic device 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modulation and demodulation processor, the baseband processor, and the like. Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 200 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
移动通信模块250可以提供应用在电子设备200上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块250可以包括至少一个滤波器、开关、功率放大器、低噪声放大器(low noise amplifier,LNA)等。移动通信模块250可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块250还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块250的至少部分功能模块可以被设置于处理器210中。在一些实施例中,移动通信模块250的至少部分功能模块可以与处理器210的至少部分模块被设置在同一个器件中。The mobile communication module 250 may provide a wireless communication solution including 2G/3G/4G/5G, etc. applied on the electronic device 200 . The mobile communication module 250 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), and the like. The mobile communication module 250 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation. The mobile communication module 250 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves for radiation through the antenna 1 . In some embodiments, at least part of the functional modules of the mobile communication module 250 may be provided in the processor 210 . In some embodiments, at least part of the functional modules of the mobile communication module 250 may be provided in the same device as at least part of the modules of the processor 210 .
无线通信模块260可以提供应用在电子设备200上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络)、蓝牙(bluetooth,BT)、全球导航卫星系统(global navigation satellite system,GNSS)、调频(frequency modulation,FM)、近距离无线通信技术(near field communication,NFC)、红外线(infrared radiation,IR)技术等无线通信的解决方案。无线通信模块260可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块260经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器210。无线通信模块260还可以从处理器210接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。The wireless communication module 260 can provide applications on the electronic device 200 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), global navigation satellites Wireless communication solutions such as global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), infrared (infrared radiation, IR) technology. The wireless communication module 260 may be one or more devices integrating at least one communication processing module. The wireless communication module 260 receives electromagnetic waves via the antenna 2 , modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 210 . The wireless communication module 260 can also receive the signal to be sent from the processor 210 , perform frequency modulation on the signal, amplify the signal, and then convert it into an electromagnetic wave for radiation through the antenna 2 .
在一些实施例中,电子设备200的天线1和移动通信模块250耦合,天线2和无线通信模块260耦合,使得电子设备200可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications, GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址接入(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、时分码分多址(time-division code division multiple access,TD-SCDMA)、长期演进(long term evolution,LTE)、BT、GNSS、WLAN、NFC、FM、和/或IR技术等。In some embodiments, the antenna 1 of the electronic device 200 is coupled with the mobile communication module 250, and the antenna 2 is coupled with the wireless communication module 260, so that the electronic device 200 can communicate with the network and other devices through wireless communication technology. The wireless communication technologies may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
可以理解的是,图2所示的部件并不构成对电子设备200的具体限定,电子设备200还可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。此外,图2中的部件之间的组合/连接关系也是可以调整修改的。It can be understood that the components shown in FIG. 2 do not constitute a specific limitation on the electronic device 200, and the electronic device 200 may also include more or less components than those shown in the figure, or combine some components, or separate some components components, or a different arrangement of components. In addition, the combination/connection relationship between the components in FIG. 2 can also be adjusted and modified.
电子设备的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例中是以分层架构为例,其中,分层架构可以包括鸿蒙(Harmony)操作系统
Figure PCTCN2021133119-appb-000003
或者其它操作系统。本申请实施例提供的语音唤醒方法可以适用于集成上述操作系统的终端。
The software system of the electronic device may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiments of the present application take a layered architecture as an example, where the layered architecture may include a Harmony operating system
Figure PCTCN2021133119-appb-000003
or other operating systems. The voice wake-up method provided by the embodiment of the present application may be applicable to a terminal integrated with the foregoing operating system.
以上图2分别为本申请实施例适用的电子设备的硬件结构,为解决背景技术中提出的问题,本申请实施例提供一种语音唤醒方法,该方法可以利用包括用户位置和设备位置的位置地图和设备所采集用户的人脸朝向,准确确定用户想要唤醒的设备,提高多设备场景中指向性设备唤醒结果的准确性。The above FIG. 2 is the hardware structure of the electronic device to which the embodiment of the present application is applied. In order to solve the problems raised in the background technology, the embodiment of the present application provides a voice wake-up method, which can utilize a location map including the user's location and the device's location. And the face orientation of the user collected by the device can accurately determine the device the user wants to wake up, and improve the accuracy of the directional device wake-up result in the multi-device scenario.
示例的,如图3所示,为本申请实施例适用的一种多设备场景的示意图。具体的,在图3所示的多设备场景中,假设电子设备10、电子设备20和电子设备30均为设有多麦克风阵列的拾音设备,且均预先设置同一唤醒词,例如唤醒词为“小艺小艺”。当用户说出“小艺小艺”时,电子设备10、电子设备20和电子设备30均能采集或接收到该语音。电子设备30一方面可以利用该唤醒语音确定用户在预先训练得到的设备地图中的位置;另一方面,电子设备30可以利用图像采集的人脸信息,确定出人脸朝向,从而根据人脸朝向和包括用户位置的位置地图,确定人脸所朝向的目标设备是电子设备10、电子设备20和电子设备30中的哪一个。Illustratively, as shown in FIG. 3 , it is a schematic diagram of a multi-device scenario to which this embodiment of the present application is applied. Specifically, in the multi-device scenario shown in FIG. 3 , it is assumed that the electronic device 10 , the electronic device 20 and the electronic device 30 are all sound pickup devices with multi-microphone arrays, and the same wake-up word is preset, for example, the wake-up word is "Little Art". When the user speaks "Xiaoyi Xiaoyi", the electronic device 10, the electronic device 20 and the electronic device 30 can all collect or receive the voice. On the one hand, the electronic device 30 can use the wake-up voice to determine the position of the user in the device map obtained by pre-training; and a location map including the user's location, to determine which of the electronic device 10 , the electronic device 20 and the electronic device 30 the target device the face is facing is.
需要说明的是,图3仅为一种多设备场景的示例,本申请实施例对多设备场景中电子设备的个数不作限定,对电子设备中预先设置的唤醒词也不作限定。另外,需要说明的是,在其它可能的情况下,也可以是电子设备30不采集人脸图像,而是从其它设备(如电子设备10或电子设备20)获取人脸图像采集结果,电子设备30可以是数据处理能力强的中枢设备,如智能家居场景中的智能音箱或智慧屏等。下文为了描述方便,均是以电子设备30具有人脸图像采集功能且为中枢设备进行说明的。It should be noted that FIG. 3 is only an example of a multi-device scenario, and the embodiment of the present application does not limit the number of electronic devices in the multi-device scenario, nor does it limit the pre-set wake-up words in the electronic devices. In addition, it should be noted that in other possible situations, the electronic device 30 may not collect the face image, but obtain the face image collection result from other devices (such as the electronic device 10 or the electronic device 20 ). 30 may be a central device with strong data processing capabilities, such as a smart speaker or a smart screen in a smart home scenario. In the following description, for the convenience of description, the electronic device 30 has a face image acquisition function and is the central device for description.
结合图3所示的多设备场景,本申请实施例的一种语音唤醒方法,对本申请实施例语音唤醒方法进行具体说明。如图4所示,该方法流程具体包括以下步骤。With reference to the multi-device scenario shown in FIG. 3 , a voice wake-up method according to an embodiment of the present application will be specifically described for the voice wake-up method according to the embodiment of the present application. As shown in FIG. 4 , the method flow specifically includes the following steps.
步骤401a~步骤401c,电子设备10、电子设备20和电子设备30均实时采集周围环境声音,并将采集的周围环境声音转换为音频。In steps 401a to 401c, the electronic device 10, the electronic device 20, and the electronic device 30 all collect ambient sounds in real time, and convert the collected ambient sounds into audio.
具体地,电子设备10的多麦克风阵列采集周围环境声音,并将采集的周围环境声音转换为音频,电子设备20的多麦克风阵列采集周围环境声音,并将采集的周围环境声音转换为音频,电子设备30的多麦克风阵列采集周围环境声音,并将采集的周围环境声音转换为音频。这时,若用户发出唤醒语音,例如,用户发出“小艺小艺”时,电子设备10、电子设备20和电子设备30所采集的声音中就会包括用户的唤醒语音。Specifically, the multi-microphone array of the electronic device 10 collects ambient sounds and converts the collected ambient sounds into audio, and the multi-microphone array of the electronic device 20 collects ambient sounds and converts the collected ambient sounds into audio. The multi-microphone array of the device 30 collects ambient sound and converts the collected ambient sound into audio. At this time, if the user sends out a wake-up voice, for example, when the user sends out "Xiaoyi Xiaoyi", the sounds collected by the electronic device 10, the electronic device 20 and the electronic device 30 will include the user's wake-up voice.
示例性地,结合图1来说,用户面对着图1所示的智慧屏发出“小艺小艺”的语音唤醒指令,因智慧屏、智能音箱和智能开关均实时采集周围环境的声音并转换为音频,所以智慧 屏、智能音箱和智能开关所采集的声音中会包括用户的唤醒语音。Exemplarily, referring to FIG. 1 , the user issues a voice wake-up command of "Xiaoyi Xiaoyi" facing the smart screen shown in Converted to audio, so the sound collected by smart screens, smart speakers and smart switches will include the user's wake-up voice.
步骤402a~步骤402c,电子设备10、电子设备20和电子设备30均对生成的音频进行唤醒词检测。In steps 402a to 402c, the electronic device 10, the electronic device 20 and the electronic device 30 all perform wake word detection on the generated audio.
具体来说,电子设备10、电子设备20和电子设备30可以对自身设备采集的音频中的每个滑动窗口内的数据做一维卷积,提取出数据中不同频段的特征,从而当从该音频识别出具有与用户的预设语音特征一致的音频片段时,则说明该音频片段包括唤醒词,否则,则不包括唤醒词。Specifically, the electronic device 10, the electronic device 20, and the electronic device 30 can perform one-dimensional convolution on the data in each sliding window in the audio collected by their own devices, and extract the characteristics of different frequency bands in the data, so that when the When the audio recognizes an audio segment that is consistent with the user's preset voice characteristics, it means that the audio segment includes a wake-up word; otherwise, it does not include a wake-up word.
步骤403a~步骤403b,当电子设备10和电子设备20检测到唤醒词时,则均向电子设备30发送唤醒消息,以及在发送唤醒消息的同时携带自身设备所生成的音频的信息。In steps 403a to 403b, when the electronic device 10 and the electronic device 20 detect the wake-up word, both send a wake-up message to the electronic device 30, and send the wake-up message with audio information generated by their own devices.
具体地,电子设备10检测到唤醒词时,电子设备10向电子设备30发送第一唤醒消息和音频的信息,该第一唤醒消息用于请求确认是否唤醒电子设备10;电子设备20检测到唤醒词时,电子设备20向电子设备30发送第二唤醒消息和音频的信息,该第二唤醒消息用于请求确认是否唤醒电子设备20。其中,音频的信息可以包括音频的全部数据,或者,音频的信息可以包括与语音唤醒指令相关的到达时刻和相位等信息。Specifically, when the electronic device 10 detects the wake-up word, the electronic device 10 sends a first wake-up message and audio information to the electronic device 30, where the first wake-up message is used to request confirmation of whether to wake up the electronic device 10; the electronic device 20 detects the wake-up The electronic device 20 sends a second wake-up message and audio information to the electronic device 30 when the word is activated, and the second wake-up message is used to request confirmation whether to wake up the electronic device 20 . The audio information may include all data of the audio, or the audio information may include information such as arrival time and phase related to the voice wake-up command.
举例来说,电子设备30所生成的第一音频的信息包括:语音唤醒指令到达第一麦克风的第一到达时刻(time of arrival),语音唤醒指令到达第二麦克风的第二到达时刻,以及语音唤醒指令到达第一麦克风的第一相位、语音唤醒指令到达第二麦克风的第二相位。需要说明的是,第一到达时刻指的是第一麦克风最早拾音到语音唤醒指令的时刻,第二到达时刻指的是第二麦克风最早拾音到语音唤醒指令的时刻。For example, the information of the first audio generated by the electronic device 30 includes: the first time of arrival when the voice wake-up command reaches the first microphone, the second time of arrival when the voice wake-up command reaches the second microphone, and the voice The wake-up command reaches the first phase of the first microphone, and the voice wake-up command reaches the second phase of the second microphone. It should be noted that the first arrival time refers to the time when the first microphone picks up sound to the voice wake-up command, and the second arrival time refers to the first time when the second microphone picks up the voice wake-up command.
电子设备20包括第三麦克风和第四麦克风;电子设备20生成的第二音频的信息包括:语音唤醒指令到达第三麦克风的第三到达时刻,语音唤醒指令到达第四麦克风的第四到达时刻,以及语音唤醒指令到达第三麦克风的第三相位、语音唤醒指令到达第四麦克风的第四相位。需要说明的是,第三到达时刻指的是第三麦克风最早拾音到语音唤醒指令的时刻,第四到达时刻指的是第四麦克风最早拾音到语音唤醒指令的时刻。The electronic device 20 includes a third microphone and a fourth microphone; the information of the second audio generated by the electronic device 20 includes: the voice wake-up command reaches the third arrival time of the third microphone, the voice wake-up command reaches the fourth arrival time of the fourth microphone, And the voice wake-up command reaches the third phase of the third microphone, and the voice wake-up command reaches the fourth phase of the fourth microphone. It should be noted that the third arrival time refers to the time when the third microphone picks up the voice at the earliest to the voice wake-up command, and the fourth arrival time refers to the first time when the fourth microphone picks up the voice wake-up command.
需要说明的是,电子设备10、电子设备20和电子设备30还可以包括其它麦克风,本申请实施例对麦克风的数量并不作限定,其它麦克风也可以按照上述方法采集声音。It should be noted that the electronic device 10 , the electronic device 20 , and the electronic device 30 may further include other microphones, and the number of microphones is not limited in the embodiment of the present application, and other microphones may also collect sound according to the above method.
另外,需要说明的是,若电子设备10或电子设备20未检测到唤醒词,则不需要向电子设备30发送唤醒消息,电子设备10或电子设备20只需要向电子设备30发送音频的信息。同样地,若电子设备20未检测到唤醒词,则不需要向电子设备30发送唤醒消息,电子设备20只需要向电子设备30发送音频的信息,本实施例不再画图一一示出。In addition, it should be noted that if the electronic device 10 or the electronic device 20 does not detect the wake-up word, there is no need to send a wake-up message to the electronic device 30 , and the electronic device 10 or the electronic device 20 only needs to send audio information to the electronic device 30 . Similarly, if the electronic device 20 does not detect a wake-up word, it does not need to send a wake-up message to the electronic device 30 , and the electronic device 20 only needs to send audio information to the electronic device 30 , which is not shown one by one in this embodiment.
步骤403c,当电子设备30也检测到唤醒词时,则也生成第三唤醒消息,否则的话,则不生成第三唤醒消息。其中,该第三唤醒消息用于请求确认是否唤醒电子设备30。In step 403c, when the electronic device 30 also detects the wake-up word, the third wake-up message is also generated, otherwise, the third wake-up message is not generated. Wherein, the third wake-up message is used to request confirmation whether to wake up the electronic device 30 .
步骤404,电子设备30根据预先训练的设备地图和电子设备10、电子设备20和电子设备30中的任意两个电子设备所采集的音频,确定用户在设备地图中的相对位置,从而生成包括用户位置的位置地图。Step 404, the electronic device 30 determines the relative position of the user in the device map according to the pre-trained device map and the audios collected by any two electronic devices in the electronic device 10, the electronic device 20 and the electronic device 30, thereby generating an image including the user Location map of the location.
其中,电子设备之间基于多设备互联技术(如HiLink(一种多设备互联技术))同步多设备之间的信息,具体地,电子设备10、电子设备20和电子设备30可连接同一局域网,以实现相互通信,再或者电子设备10、电子设备20和电子设备30可以预先绑定有同一用户账号(如华为帐号),或者,电子设备10、电子设备20和电子设备30绑定有不同用户账号,且不同的用户帐号建立有绑定关系(如预先绑定家人的用户帐号,即授权自己设备 可以与家人的设备互相连接),以保证设备之间的安全通信。Among them, the information between the electronic devices is synchronized based on a multi-device interconnection technology (such as HiLink (a multi-device interconnection technology)). Specifically, the electronic device 10, the electronic device 20, and the electronic device 30 can be connected to the same local area network. In order to realize mutual communication, or electronic device 10, electronic device 20 and electronic device 30 can be pre-bound with the same user account (such as a HUAWEI ID), or electronic device 10, electronic device 20 and electronic device 30 can be bound with different users account, and different user accounts have a binding relationship (such as pre-binding a family member's user account, that is, authorizing one's own device to connect with the family's device) to ensure secure communication between devices.
示例性地,如图5所示,电子设备30的第一麦克风和第二麦克风均采集声音,记录得到第一音频的信息。一方面,电子设备30可以根据语音唤醒指令到达电子设备30的第一麦克风的第一相位和语音唤醒指令到达电子设备30的第二麦克风的第二相位之间的相位差,确定电子设备30和用户之间的方向角;电子设备30从电子设备20获取第二音频的信息,所以电子设备30可以根据电子设备20的第三相位和第四相位之间的相位差,确定电子设备20和用户之间的方向角。另一方面,因电子设备30可以根据电子设备30的第一到达时刻和第二到达时刻之间的时间差,确定电子设备30和用户之间的第一相对距离,再者,电子设备30可以根据电子设备20的第三到达时刻和第四到达时刻之间的时间差,确定电子设备20和用户之间的第二相对距离。这样,结合第一方位角、第二方位角、第一相对距离和第二相对距离就可以确定用户位置。Exemplarily, as shown in FIG. 5 , the first microphone and the second microphone of the electronic device 30 both collect sound, and record the information of the first audio. On the one hand, the electronic device 30 may determine the electronic device 30 and the electronic device 30 according to the phase difference between the first phase of the voice wake-up command reaching the first microphone of the electronic device 30 and the second phase of the voice wake-up command reaching the second microphone of the electronic device 30 . The direction angle between the users; the electronic device 30 obtains the information of the second audio from the electronic device 20, so the electronic device 30 can determine the electronic device 20 and the user according to the phase difference between the third phase and the fourth phase of the electronic device 20 direction angle between. On the other hand, because the electronic device 30 can determine the first relative distance between the electronic device 30 and the user according to the time difference between the first arrival time and the second arrival time of the electronic device 30, and furthermore, the electronic device 30 can determine the first relative distance between the electronic device 30 and the user according to the time difference between the first arrival time and the second arrival time of the electronic device 30. The time difference between the third arrival time and the fourth arrival time of the electronic device 20 determines the second relative distance between the electronic device 20 and the user. In this way, the user position can be determined by combining the first azimuth angle, the second azimuth angle, the first relative distance and the second relative distance.
示例性地,如图6A所示,A点指代的是电子设备20在预先训练的设备地图中的位置,B点指代的是电子设备30在预先训练的设备地图中的位置,θA为电子设备20相对于用户的第一方位角,θB为电子设备30相对于用户的第二方位角,PA为电子设备20相对于用户的第二相对距离,PB为电子设备30相对于用户的第一相对距离。从图中可见,θA和θB两个方位角射线的交点P为用户发出唤醒语音的位置。Exemplarily, as shown in FIG. 6A, point A refers to the position of the electronic device 20 in the pre-trained device map, point B refers to the position of the electronic device 30 in the pre-trained device map, and θA is The first azimuth angle of the electronic device 20 relative to the user, θB is the second azimuth angle of the electronic device 30 relative to the user, PA is the second relative distance of the electronic device 20 relative to the user, and PB is the first relative distance of the electronic device 30 relative to the user. a relative distance. It can be seen from the figure that the intersection point P of the two azimuth rays θA and θB is the position where the user sends out the wake-up speech.
同样地,按照上述方法,电子设备30还可以根据语音唤醒指令到达电子设备10的不同麦克风之间的相位差,确定电子设备20和用户之间的方向角。Similarly, according to the above method, the electronic device 30 can also determine the direction angle between the electronic device 20 and the user according to the phase difference between different microphones of the electronic device 10 reached by the voice wake-up command.
需要说明的是,本实施例中,具体可以采用TDoA(一种利用时间差进行定位的方法)或者MUSIC(一种声源定位方法)算法来计算发声用户的位置和与设备之间距离,对此本申请实施例并不作限定。本申请实施例中所确定的用户位置指的是相对位置。例如待定位的用户在电子设备30的正南方位,与电子设备30之间的距离为1米。It should be noted that, in this embodiment, TDoA (a method for localization using time difference) or MUSIC (a method for sound source localization) can be used to calculate the position of the sounding user and the distance from the device. The embodiments of the present application are not limited. The user location determined in the embodiment of the present application refers to a relative location. For example, the user to be located is due south of the electronic device 30, and the distance between the user and the electronic device 30 is 1 meter.
步骤405,电子设备20采集用户图像,检测人脸朝向。In step 405, the electronic device 20 collects an image of the user and detects the orientation of the face.
例如,用户正对着图1中的智慧屏发出“小艺小艺”的唤醒语音,智慧屏上的摄像头对用户进行拍照或录像。智慧屏通过对人脸图像进行分析,确定用户的人脸正朝向智慧屏。再比如,用户正对着图1中的智慧屏旁边的智能音箱发出“小艺小艺”的唤醒语音,智慧屏上的摄像头对用户进行拍照或录像。智慧屏通过对人脸图像进行分析,确定用户的人脸正朝向为第一方位角(例如第一方位角为用户的左前方)。For example, the user is facing the smart screen in Figure 1 and sends out a wake-up voice of "Xiaoyi Xiaoyi", and the camera on the smart screen takes pictures or videos of the user. The smart screen analyzes the face image to determine that the user's face is facing the smart screen. For another example, the user is facing the smart speaker next to the smart screen in Figure 1 and sends out a wake-up voice of "Xiaoyi Xiaoyi", and the camera on the smart screen takes pictures or videos of the user. By analyzing the face image, the smart screen determines that the user's face is facing the first azimuth (for example, the first azimuth is the front left of the user).
需要说明的是,本实施例是以智慧屏为主控设备(或者说中枢设备)为例进行说明,智慧屏具有图像采集功能,所以优先使用智慧屏采集的用户图像进行人脸朝向的检测。若主控设备(或者说中枢设备)不具有图像采集功能,也可以从其它具有人脸采集功能的电子设备获取用户图像,然后基于获取的用户图像进行分析。在此不再一一举例示出。It should be noted that this embodiment takes the smart screen as the main control device (or the central device) as an example for description. The smart screen has an image acquisition function, so the user image collected by the smart screen is preferentially used for face orientation detection. If the main control device (or the central device) does not have the image acquisition function, the user image may also be acquired from other electronic devices with the face acquisition function, and then analyzed based on the acquired user image. The examples are not shown one by one here.
步骤406,电子设备30根据人脸朝向和包括用户位置的位置地图,判断用户人脸所朝向的目标设备是电子设备10。Step 406 , the electronic device 30 determines that the target device to which the user's face faces is the electronic device 10 according to the face orientation and the location map including the user's location.
示例性地,电子设备30所确定出来的与图3所示的多设备场景所对应的位置地图如图6B所示,在该图中,在用户所在的位置处,用户的人脸朝向对应的视野范围为图中所示的θ角所示的范围,在该范围内存在电子设备10。Exemplarily, the location map determined by the electronic device 30 and corresponding to the multi-device scenario shown in FIG. 3 is shown in FIG. 6B . In this figure, at the location of the user, the face of the user faces the corresponding The field of view is the range shown by the angle θ shown in the figure, and the electronic device 10 exists in this range.
步骤407a,电子设备30确定目标设备是电子设备10时,向电子设备发送允许唤醒的指令,以指示电子设备10响应用户的唤醒语音。Step 407a, when the electronic device 30 determines that the target device is the electronic device 10, it sends a wake-up permission instruction to the electronic device to instruct the electronic device 10 to respond to the user's wake-up voice.
在一种可能的实施例中,该实施例还可以包括如下步骤:步骤407b,电子设备30还 可以在确定目标设备是电子设备10时,指示电子设备30自身不被唤醒,以及步骤407c,电子设备30向电子设备20发送禁止唤醒的指令,该禁止唤醒的指令用于指示电子设备20不响应用户的唤醒语音。In a possible embodiment, this embodiment may further include the following steps: step 407b, the electronic device 30 may further indicate that the electronic device 30 itself is not to be awakened when it is determined that the target device is the electronic device 10, and step 407c, the electronic device 30 The device 30 sends a wake-up prohibition instruction to the electronic device 20, where the wake-up prohibition instruction is used to instruct the electronic device 20 not to respond to the user's wake-up voice.
或者,在另一种可能的实施例中,电子设备30可以在确定目标设备是电子设备10时,不向电子设备30自身和电子设备20作出响应,这样,电子设备30和电子设备20也就不响应用户的唤醒语音。Or, in another possible embodiment, the electronic device 30 may not respond to the electronic device 30 itself and the electronic device 20 when it is determined that the target device is the electronic device 10, so that the electronic device 30 and the electronic device 20 also Does not respond to the user's wake-up voice.
综上来说,第一电子设备(如电子设备30)接收用户的语音唤醒指令;第一电子设备30从本地获取其它设备获取用户图像,检测用户人脸朝向;然后根据第一电子设备和至少一个第二电子设备(如电子设备10和电子设备20)的相对位置、用户位置(如P点位置)和用户人脸朝向,从第一电子设备和至少一个第二电子设备中确定用户人脸所朝向的目标设备(如目标设备为电子设备10),以及指示目标设备响应语音唤醒指令。可见,通过上述方法,就可以有效地改善“一呼多应”或“一呼多响”的问题,使得用户在面向电子设备10发出唤醒指令的时候,电子设备10会对其作出语音交互响应,而其它设备不会作出响应。To sum up, the first electronic device (such as the electronic device 30) receives the user's voice wake-up command; the first electronic device 30 obtains the user's image from other devices locally, and detects the user's face orientation; then according to the first electronic device and at least one The relative position of the second electronic device (such as the electronic device 10 and the electronic device 20), the user's position (such as the P point position) and the orientation of the user's face, determine the position of the user's face from the first electronic device and at least one second electronic device. The facing target device (eg, the target device is the electronic device 10 ), and instruct the target device to respond to the voice wake-up command. It can be seen that the above method can effectively improve the problem of "multiple responses with one call" or "multiple calls with one call", so that when the user issues a wake-up command to the electronic device 10, the electronic device 10 will make a voice interactive response to it. , and other devices will not respond.
在本实施例中,并不要求多设备场景中的电子设备需要具有图像采集功能,也就是说,多设备场景中只要有一个设备具有图像采集功能,可以采集到人脸图像,就可以结合位置地图和人脸图像,确定出用户当前所指向的设备,而被指向的设备完全可以不具有图像采集功能,因此一定程度上,该方法应用场景更广。In this embodiment, it is not required that the electronic devices in the multi-device scenario need to have an image capture function, that is, as long as one device in the multi-device scenario has an image capture function and can capture a face image, it can be combined with the location Maps and face images determine the device that the user is currently pointing to, and the pointed device may not have an image acquisition function at all, so to a certain extent, this method has wider application scenarios.
另外,在本实施例中,当用户移动到场景中的其它位置发出唤醒语音时,仍可以按照上述步骤401a~步骤404所示的方法重新定位出用户当前的位置,从而仍可以结合采集的人脸图像,准确定位出用户移动后的位置。示例性地,当用户在室内走动到智能音箱附近时,电子设备30可以按照上述方法定位出用户的位置从位置P1移动到位置P2。可见,该方法可以实时定位出发声用户的当前位置。进一步地,还可以结合当前采集的人脸图像和用户的当前位置P2,确定出目标设备为图6C所示的电子设备20。可见,按照上述方法始终能够指向性地定位出被唤醒的设备,并不受限于的用户的位置是否发生改变。In addition, in this embodiment, when the user moves to another position in the scene to send out the wake-up voice, the current position of the user can still be relocated according to the methods shown in the above steps 401a to 404, so that the collected people can still be combined Face image, accurately locate the user's position after moving. Exemplarily, when the user walks around the smart speaker indoors, the electronic device 30 can locate the user's position from the position P1 to the position P2 according to the above method. It can be seen that this method can locate the current position of the vocal user in real time. Further, the target device can also be determined to be the electronic device 20 shown in FIG. 6C in combination with the currently collected face image and the current position P2 of the user. It can be seen that, according to the above method, the awakened device can always be located in a directional manner, and is not limited by whether the user's position changes.
再者,在一种可能的实施例中,若结合用户位置和人脸朝向确定出用户人脸所朝向的设备存在两个以上的候选设备时,这时,可以进一步结合候选设备与用户位置之间的相对距离确定目标设备,即从两个以上的候选设备中选择相对距离最短的候选设备为目标设备。也就是说,电子设备30计算人脸朝向范围内的候选设备与用户位置之间的相对距离,然后按照相对距离的大小确定各个候选设备的唤醒优先级,相对距离越小的设备,唤醒优先级越高,反之,相对距离越大的设备,唤醒优先级越低。这样,电子设备30可以指向性唤醒优先级最高的候选设备。示例性地,如图6D所示,当用户在室内走动到智能音箱附近时,电子设备30可以按照上述方法定位出用户的位置从P1移动到P2,再结合当前采集的人脸图像和用户的当前位置P2,确定出人脸朝向对应的视野范围内的候选设备有电子设备20(如图所示的智能音箱)和电子设备40(如图所示的智能闹钟)。电子设备30确定出电子设备20和P2之间的相对距离为D2,以及确定出电子设备40与P2之间的相对距离为D1,因D1大于D2,所以,电子设备20的唤醒优先级高,故电子设备30确定电子设备10为待唤醒的目标设备。Furthermore, in a possible embodiment, if it is determined that there are more than two candidate devices for the device to which the user's face is facing by combining the user's position and the face orientation, at this time, the candidate device and the user's position can be further combined. The relative distance between them determines the target device, that is, the candidate device with the shortest relative distance is selected from the two or more candidate devices as the target device. That is to say, the electronic device 30 calculates the relative distance between the candidate device within the face orientation range and the user's position, and then determines the wake-up priority of each candidate device according to the size of the relative distance. The device with the smaller relative distance has the wake-up priority. The higher the value, and vice versa, the device with the larger relative distance has a lower wake-up priority. In this way, the electronic device 30 can directionally wake up the candidate device with the highest priority. Exemplarily, as shown in FIG. 6D , when the user walks around the smart speaker indoors, the electronic device 30 can locate the user’s position from P1 to P2 according to the above method, and then combine the currently collected face image and the user’s face image. At the current position P2, candidate devices within the visual field corresponding to the face orientation are determined to include electronic device 20 (smart speaker as shown in the figure) and electronic device 40 (smart alarm clock as shown in the figure). The electronic device 30 determines that the relative distance between the electronic device 20 and P2 is D2, and determines that the relative distance between the electronic device 40 and P2 is D1. Since D1 is greater than D2, the wake-up priority of the electronic device 20 is high, Therefore, the electronic device 30 determines that the electronic device 10 is the target device to be awakened.
可见,在上述方法中需要借助预先训练的设备地图,也就是说,首先需要构建出多设备场景中各个设备的相对位置地图,才能确定出用户在该设备地图中的相对位置。为此, 本申请实施例提供一种设备地图的训练方法,该方法通过对用户的历史唤醒语音进行语音分析,从而反向推导出多个拾音设备之间的相对位置。下文结合图7至图9示例性地给出设备地图的构建方式。It can be seen that in the above method, a pre-trained device map is required, that is, a relative position map of each device in a multi-device scenario needs to be constructed first, and then the relative position of the user in the device map can be determined. To this end, an embodiment of the present application provides a method for training a device map, which reversely deduces the relative positions of a plurality of sound pickup devices by performing voice analysis on a user's historical wake-up speech. The construction of the device map is exemplarily given below with reference to FIGS. 7 to 9 .
示例性地,如图7所示,为本申请实施例提供的一种多设备场景的示意图。该场景中部署有多个拾音设备,其中,具有图像采集模块的主设备为图中的智慧屏设备71,在智慧屏设备71所在空间中还部署有多个拾音设备72a至拾音设备72f。需要说明的是,图中是以多个音箱示意拾音设备72a至拾音设备72f,在实际的家居场景中,音箱还可以被替换为智能闹钟、智能摄像头、智能开关等设备,在此不再一一用图例示出,图中仅以音箱示意智慧屏设备71周围的拾音设备。Exemplarily, as shown in FIG. 7 , it is a schematic diagram of a multi-device scenario provided by an embodiment of the present application. In this scene, a plurality of sound pickup devices are deployed, wherein the main device with the image acquisition module is the smart screen device 71 in the figure, and a plurality of sound pickup devices 72a to the sound pickup device are also deployed in the space where the smart screen device 71 is located. 72f. It should be noted that, in the figure, multiple speakers are used to indicate the sound pickup device 72a to the sound pickup device 72f. In the actual home scene, the speakers can also be replaced with devices such as smart alarm clocks, smart cameras, and smart switches. Again, one by one is illustrated with an illustration. In the figure, only a sound box is used to indicate the sound pickup devices around the smart screen device 71 .
在图7所示的智能家居场景中,用户可能在该场景的任意位置唤醒任意一个拾音设备,例如,用户可能坐在沙发上呼叫“小艺小艺”唤醒了智慧屏设备,再比如,用户面对拾音设备72a(如音箱)呼叫“小艺小艺”唤醒了拾音设备72a。依次类推,在用户使用过程中,各个拾音设备将用户的历史唤醒语音记录下来,并且同步至中枢设备,中枢设备可以是图7所示的智慧屏设备,还可以是智能音箱或路由器等。In the smart home scene shown in Figure 7, the user may wake up any sound pickup device at any position in the scene. For example, the user may sit on the sofa and call "Xiaoyi Xiaoyi" to wake up the smart screen device. For another example, The user calls "Xiaoyi Xiaoyi" facing the sound pickup device 72a (such as a speaker) to wake up the sound pickup device 72a. By analogy, during the user's use, each pickup device records the user's historical wake-up voice and synchronizes it to the central device. The central device can be the smart screen device shown in Figure 7, or it can be a smart speaker or router.
如图8A所示,假设用户第一次呼叫“小艺小艺”时的发声位置为发声点P1,用户第二次呼叫“小艺小艺”时的发声位置为发声点P2。当然在设备使用期间,用户还可能在其它位置呼叫其它的拾音设备,该处先以两次唤醒语音分别唤醒拾音设备72a和拾音设备72b为例。拾音设备72a和拾音设备72b将采集到的声音转换成音频后同步至中枢设备,中枢设备可以对拾音设备72a和拾音设备72b分别采集的声音进行声纹分析,具体来说,拾音设备72a和拾音设备72b可以对自身设备采集的音频中的每个滑动窗口内的数据做一维卷积,提取出数据中不同频段的声纹特征,从而当从该音频识别出具有与预设唤醒语音的声纹特征一致的音频片段时,则说明该音频片段包括该用户发出的唤醒词,否则,则不包括该用户发出的唤醒词。如图8B所示,按照上述语音检测方式得出用户在P1点发出的唤醒语音到达拾音设备72a的到达时刻t1,以及用户在P1点发出的唤醒语音到达拾音设备72a的到达时刻t2,图中所示的Δt内的音频声纹相同,即为同一用户的唤醒语音,从而得到到达时间差Δτ1=t2-t1。As shown in FIG. 8A , it is assumed that the sounding position when the user calls "Xiaoyi Xiaoyi" for the first time is the sounding point P1, and the sounding position when the user calls "Xiaoyi Xiaoyi" for the second time is the sounding point P2. Of course, during the use of the device, the user may also call other pickup devices at other locations. Here, the pickup device 72a and the pickup device 72b are awakened by two wake-up voices as an example. The sound pickup device 72a and the sound pickup device 72b convert the collected sounds into audio and then synchronize to the central device. The central device can perform voiceprint analysis on the sounds collected by the sound pickup device 72a and the sound pickup device 72b respectively. The audio device 72a and the pickup device 72b can perform one-dimensional convolution on the data in each sliding window in the audio collected by their own devices, and extract the voiceprint features of different frequency bands in the data, so that when the audio is identified from the audio with When an audio clip with the same voiceprint characteristics of the wake-up voice is preset, it means that the audio clip includes the wake-up word issued by the user; otherwise, the wake-up word issued by the user is not included. As shown in FIG. 8B , according to the above-mentioned voice detection method, the arrival time t1 at which the wake-up voice sent by the user at point P1 reaches the sound pickup device 72a, and the arrival time t2 at which the wake-up voice sent by the user at point P1 reaches the sound pickup device 72a, The audio voiceprints in Δt shown in the figure are the same, that is, the wake-up voice of the same user, so that the arrival time difference Δτ1=t2-t1 is obtained.
在一种可能的实施例中,在有些特殊情况,可能存在多个用户在同一时段同时发出相同的唤醒语音,例如爸爸和孩子同时在客厅呼叫“小艺小艺”,这时,虽然按照上述方式可以检测到该时段内唤醒语音对应的音频片段,但因该时间段内出现多种声纹,音频发生重合,不利于准确确定唤醒语音的相位,因此,中枢设备还可以进一步地对获取的音频进行筛选,筛除同一时段内发生音频重合的音频,以便于提高设备地图计算结果的准确性。In a possible embodiment, in some special cases, there may be multiple users who send out the same wake-up voice at the same time. For example, the father and the child call "Xiaoyi Xiaoyi" in the living room at the same time. The method can detect the audio segment corresponding to the wake-up voice in this time period, but because multiple voiceprints appear in this time period, the audio overlaps, which is not conducive to accurately determine the phase of the wake-up voice. Therefore, the central device can further The audio is filtered, and the audio that overlaps with the audio in the same period is filtered out, so as to improve the accuracy of the calculation result of the device map.
具体地,在拾音设备72a建立极坐标系A,在拾音设备72b上建立极坐标系B,以AB两点为X轴建立直角坐标系C。如图8A所示,
Figure PCTCN2021133119-appb-000004
为极坐标系A与直角坐标系C之间的夹角;θA1为发声点P1到A点与直角坐标系C的x轴的夹角;θB1为发声点P1到B点与直角坐标系C的x轴的夹角;dA1为发声点P1到A点的距离;dB1为发声点P1到B点的距离。另外,
Figure PCTCN2021133119-appb-000005
为极坐标系B与直角坐标系C之间的交角;θA2为发声点P2到A点与直角坐标系C的x轴的夹角;θB2为发声点P2到B点与直角坐标系C的x轴的夹角;dA2为发声点P2到A点的距离;dB2为发声点P2到B点的距离。
Specifically, a polar coordinate system A is established on the sound pickup device 72a, a polar coordinate system B is established on the sound pickup device 72b, and a rectangular coordinate system C is established with two points AB as the X axis. As shown in Figure 8A,
Figure PCTCN2021133119-appb-000004
is the angle between the polar coordinate system A and the rectangular coordinate system C; θA1 is the angle between the sounding point P1 to point A and the x-axis of the rectangular coordinate system C; θB1 is the sounding point P1 to point B and the rectangular coordinate system C. The included angle of the x-axis; dA1 is the distance from point P1 to point A; dB1 is the distance from point P1 to point B. in addition,
Figure PCTCN2021133119-appb-000005
is the intersection angle between the polar coordinate system B and the rectangular coordinate system C; θA2 is the angle between the sound point P2 to point A and the x-axis of the rectangular coordinate system C; θB2 is the sound point P2 to point B and the x-axis of the rectangular coordinate system C The included angle of the axis; dA2 is the distance from the sounding point P2 to point A; dB2 is the distance from the sounding point P2 to point B.
本实施例中,利用用户在P1点发出的唤醒信号到极坐标系A和极坐标系B(即拾音设备72a和拾音设备72b)之间的时间差Δτ1,可以计算出dA1-dB1的距离差,即:In this embodiment, the distance dA1-dB1 can be calculated by using the time difference Δτ1 between the wake-up signal sent by the user at point P1 and the polar coordinate system A and the polar coordinate system B (that is, the sound pickup device 72a and the sound pickup device 72b ). poor, that is:
dA1-dB1=Δτ1·v=a1……公式一dA1-dB1=Δτ1·v=a1…Formula 1
其中,dA1为发声点P1到A点的距离;dB1为发声点P1到B点的距离,Δτ1为用户在P1点发出的唤醒信号到极坐标系A和极坐标系B之间的时间差,即Δτ1为用户发出的唤醒信号到两个不同电子设备的时间差,v为声音传输的速度,a1为dA1-dB1的距离差,即a1为用户发声位置到两个不同电子设备的距离差。Among them, dA1 is the distance from sound point P1 to point A; dB1 is the distance from sound point P1 to point B, Δτ1 is the time difference between the wake-up signal sent by the user at point P1 and polar coordinate system A and polar coordinate system B, namely Δτ1 is the time difference between the wake-up signal sent by the user and two different electronic devices, v is the speed of sound transmission, and a1 is the distance difference between dA1-dB1, that is, a1 is the distance difference between the user's voice position and two different electronic devices.
另外,利用用户在P1点发出的唤醒信号到极坐标系A的相位,以及利用用户在P1点发出的唤醒信号到极坐标系B的相位,可以得到如下公式:In addition, using the phase of the wake-up signal sent by the user at point P1 to polar coordinate system A, and the phase of the wake-up signal sent by the user at point P1 to polar coordinate system B, the following formula can be obtained:
180°-θA1-θB1=b1       公式二180°-θA1-θB1=b1 Formula 2
其中,θA1为发声点P1到A点与直角坐标系C的x轴的夹角;θB1为发声点P1到B点与直角坐标系C的x轴的夹角,b1为P1到A点的距离与P1到B点的距离之间的夹角。Among them, θA1 is the angle between the sounding point P1 and point A and the x-axis of the rectangular coordinate system C; θB1 is the angle between the sounding point P1 and point B and the x-axis of the rectangular coordinate system C, and b1 is the distance from P1 to point A The angle between P1 and the distance from point B.
具体地,可以利用图6A所示的唤醒语音到达不同麦克风的时间差和相位得到θA1和θB1,在此不再重复给出计算过程。Specifically, θA1 and θB1 can be obtained by using the time difference and phase of the wake-up speech reaching different microphones as shown in FIG. 6A , and the calculation process is not repeated here.
另外,
Figure PCTCN2021133119-appb-000006
Figure PCTCN2021133119-appb-000007
之间的差值是未知常量,该未知常量用于表征拾音设备72a对应的拾音方向和拾音设备72b对应的拾音方向之间的夹角。即
Figure PCTCN2021133119-appb-000008
Figure PCTCN2021133119-appb-000009
之间满足如下公式:
in addition,
Figure PCTCN2021133119-appb-000006
and
Figure PCTCN2021133119-appb-000007
The difference between them is an unknown constant, and the unknown constant is used to characterize the included angle between the sound pickup direction corresponding to the sound pickup device 72a and the sound pickup direction corresponding to the sound pickup device 72b. which is
Figure PCTCN2021133119-appb-000008
and
Figure PCTCN2021133119-appb-000009
The following formulas are satisfied between:
Figure PCTCN2021133119-appb-000010
Figure PCTCN2021133119-appb-000010
再者,利用余弦定理可以得到如下公式:Furthermore, using the cosine theorem, the following formula can be obtained:
Figure PCTCN2021133119-appb-000011
Figure PCTCN2021133119-appb-000011
将上述公式四进一步推导,可以得到如下公式五:Further derivation of the above formula 4, the following formula 5 can be obtained:
Figure PCTCN2021133119-appb-000012
Figure PCTCN2021133119-appb-000012
联立上述公式一、公式二、公式三和公式五,并使用如下字母简化常量,可以得到如下方程式:By combining the above formula 1, formula 2, formula 3 and formula 5, and simplifying the constants with the following letters, the following equations can be obtained:
Figure PCTCN2021133119-appb-000013
Figure PCTCN2021133119-appb-000013
其中,
Figure PCTCN2021133119-appb-000014
in,
Figure PCTCN2021133119-appb-000014
利用公式二和公式三进一步简化得到如下公式六,Using formula 2 and formula 3 to further simplify the following formula 6,
Figure PCTCN2021133119-appb-000015
Figure PCTCN2021133119-appb-000015
其中,x1为未知数,指代dA1。Among them, x1 is an unknown number, which refers to dA1.
依次类推,因用户第二次呼叫“小艺小艺”时的发声位置为发声点P2,依据θA2为发声点P2到A点与直角坐标系C的x轴的夹角;θB2为发声点P2到B点与直角坐标系C的x轴的夹角;dA2为发声点P2到A点的距离;dB2为发声点P2到B点的距离,按照上述计算方式,可以联立得到如下方程式:By analogy, because the sounding position when the user calls "Xiaoyi Xiaoyi" for the second time is the sounding point P2, according to θA2 is the angle between the sounding point P2 and point A and the x-axis of the rectangular coordinate system C; θB2 is the sounding point P2 The angle between point B and the x-axis of the Cartesian coordinate system C; dA2 is the distance from point P2 to point A; dB2 is the distance from point P2 to point B. According to the above calculation method, the following equations can be obtained simultaneously:
Figure PCTCN2021133119-appb-000016
Figure PCTCN2021133119-appb-000016
同样地,可以得到Likewise, one can get
Figure PCTCN2021133119-appb-000017
Figure PCTCN2021133119-appb-000017
其中,x 2为未知数,指代dA2。 Among them, x 2 is an unknown number, which refers to dA2.
依次类推,因用户第n次呼叫“小艺小艺”时的发声位置为发声点Pn,依据θAn为发声点Pn到A点与直角坐标系C的x轴的夹角;θBn为发声点Pn到B点与直角坐标系C的x轴的夹角;dAn为发声点Pn到A点的距离;dBn为发声点Pn到B点的距离,按照上述计算方式,可以联立得到如下方程式:And so on, because the sounding position when the user calls "Xiaoyi Xiaoyi" for the nth time is the sounding point Pn, according to θAn is the angle between the sounding point Pn to point A and the x-axis of the Cartesian coordinate system C; θBn is the sounding point Pn The angle between point B and the x-axis of the rectangular coordinate system C; dAn is the distance from the sound point Pn to point A; dBn is the distance from the sound point Pn to point B. According to the above calculation method, the following equations can be obtained simultaneously:
Figure PCTCN2021133119-appb-000018
Figure PCTCN2021133119-appb-000018
同样地,可以得到
Figure PCTCN2021133119-appb-000019
其中,x n为未知数,指代dAn。
Likewise, one can get
Figure PCTCN2021133119-appb-000019
Among them, x n is an unknown number, which refers to dAn.
也就是说,依据用户的历史唤醒语音,按照上述方法可以联立如下方程式:That is to say, according to the user's historical wake-up voice, the following equations can be combined according to the above method:
Figure PCTCN2021133119-appb-000020
Figure PCTCN2021133119-appb-000020
使用字母简化常量,联力化简后的等式:Simplify the constants using letters to combine the simplified equations:
Figure PCTCN2021133119-appb-000021
Figure PCTCN2021133119-appb-000021
这样,利用对大量的历史唤醒语音的分析结果,就可以联立出多个方程式,通过算法(如数值解法)求解,可以得到最优设备的设备间距离和设备间夹角,从而计算出设备之间的相对位置,示例性地,如图8C所示。也就是说,通过距离发散衰减公式计算的边长带入以上方程,通过网格搜索法(Grid search)遍历搜索设备距离和设备间的夹角,最小二乘法得出使得所有方程的损失函数最小的设备间距,让损失函数最小的设备间距和设备夹角就是夜间离线学习得到的精确设备相对位置。这样通过网格法穷举搜索法,就可以得出设备间最优相对位置,随着积累的用户历史唤醒语音的增多,设备会一直记录用户呼叫小艺时候的打点数据,设备可以在夜间时间使用保存的数据进行离线学习和训练,达到越用越聪明的效果,穷举搜索法计算出的误差就越小。In this way, using the analysis results of a large number of historical wake-up voices, a number of equations can be created simultaneously, and by solving algorithms (such as numerical solutions), the distance between devices and the angle between devices of the optimal device can be obtained, and the device can be calculated. The relative positions between , exemplarily, are shown in FIG. 8C . That is to say, the side length calculated by the distance divergence attenuation formula is brought into the above equation, and the distance between the devices and the angle between the devices are searched through the grid search method (Grid search), and the least squares method is obtained to minimize the loss function of all equations. The device spacing and the device angle that minimize the loss function are the precise device relative positions obtained by offline learning at night. In this way, through the exhaustive search method of the grid method, the optimal relative position between the devices can be obtained. With the increase of the accumulated user's historical wake-up voice, the device will always record the dosing data when the user calls Xiaoyi, and the device can be used at night time. Use the saved data for offline learning and training to achieve the effect of getting smarter the more you use it, and the smaller the error calculated by the exhaustive search method.
归纳来说,如图9所示,本申请实施例的一种设备地图的训练方法,用以对多设备场景中的各个设备的相对位置进行定位。如图9所示,该方法流程具体包括以下步骤。To sum up, as shown in FIG. 9 , a device map training method according to an embodiment of the present application is used to locate the relative positions of each device in a multi-device scenario. As shown in FIG. 9 , the method flow specifically includes the following steps.
步骤901a~步骤901c,电子设备10、电子设备20和电子设备30均实时采集周围环境声音,并将采集的周围环境声音转换为音频。In steps 901a to 901c, the electronic device 10, the electronic device 20, and the electronic device 30 all collect ambient sounds in real time, and convert the collected ambient sounds into audio.
具体地,电子设备10的多麦克风阵列采集周围环境声音,并将采集的周围环境声音转换为音频,电子设备20的多麦克风阵列采集周围环境声音,并将采集的周围环境声音转换为音频,电子设备30的多麦克风阵列采集周围环境声音,并将采集的周围环境声音转换为音频。这时,若用户发出唤醒语音,例如,用户发出“小艺小艺”的语音指令时,电子设备所采集的声音中就会包括用户的唤醒语音。Specifically, the multi-microphone array of the electronic device 10 collects ambient sounds and converts the collected ambient sounds into audio, and the multi-microphone array of the electronic device 20 collects ambient sounds and converts the collected ambient sounds into audio. The multi-microphone array of the device 30 collects ambient sound and converts the collected ambient sound into audio. At this time, if the user sends a wake-up voice, for example, when the user sends a voice command of "Xiaoyi Xiaoyi", the sound collected by the electronic device will include the user's wake-up voice.
示例性地,用户面对着图1所示的智慧屏发出“小艺小艺”的唤醒语音,因智慧屏、智能 音箱和智能开关均实时采集周围环境的声音并转换为音频,所以智慧屏、智能音箱和智能开关所采集的声音中会包括用户的唤醒语音。Exemplarily, the user sends out the wake-up voice of "Xiaoyi Xiaoyi" facing the smart screen shown in Figure 1. Because the smart screen, smart speakers and smart switches all collect the sound of the surrounding environment in real time and convert it into audio , smart speakers and smart switches will include the user's wake-up voice.
步骤902a~步骤902b,电子设备10和电子设备20将生成的音频同步至电子设备30。In steps 902 a to 902 b , the electronic device 10 and the electronic device 20 synchronize the generated audio to the electronic device 30 .
也就是说,电子设备10将生成的音频同步至电子设备30;电子设备20将生成的音频同步至电子设备30。需要说明的是,电子设备10和电子设备20可以每隔固定周期将生成的音频同步至电子设备30,也可以均在固定的时间点(如凌晨一点)将生成的音频同步至电子设备30。对此,本申请并不作限定。That is, the electronic device 10 synchronizes the generated audio to the electronic device 30 ; the electronic device 20 synchronizes the generated audio to the electronic device 30 . It should be noted that the electronic device 10 and the electronic device 20 may synchronize the generated audio to the electronic device 30 at regular intervals, or both may synchronize the generated audio to the electronic device 30 at a fixed time point (eg, one o'clock in the morning). This application does not limit this.
另外,本实施例是以多设备场景中包括电子设备10、电子设备20和电子设备30,且电子设备30为中枢设备,为例进行说明。在其它可能的实施例中,还可能包括其它电子设备,或者中枢设备为其它电子设备,依次类推,其它的电子设备也可以参照上述电子设备10至电子设备30,在此不再一一举例说明。In addition, this embodiment is described by taking a multi-device scenario including the electronic device 10 , the electronic device 20 , and the electronic device 30 as an example, and the electronic device 30 is the central device. In other possible embodiments, other electronic devices may also be included, or the central device may be other electronic devices, and so on. For other electronic devices, reference may also be made to the above-mentioned electronic devices 10 to 30, which will not be described one by one here. .
步骤903,电子设备30按照上述图8A所示的方法,对电子设备10、电子设备20以及电子设备30自身所生成的音频进行分析,联立方程式,最终求解得到各个电子设备的相对位置地图。Step 903, the electronic device 30 analyzes the audio generated by the electronic device 10, the electronic device 20, and the electronic device 30 itself according to the method shown in FIG.
本实施例中,多设备之间的相对位置的计算过程是依赖历史唤醒语音的信息,通过对多设备多次接收唤醒语音的到达时间差信息,以及唤醒语音到达同一设备的不同麦克风的到达时间差,得到各设备之间的相对位置。因上述方法可以不断利用最近一段时间内的历史用户唤醒语音进行设备的定位,所以,即使在设备使用过程中,设备的位置发生变更,例如,智能音箱被从客厅移动到餐厅,按照上述方法,通过积累智能音箱被移动后一段时间内的历史唤醒语音的信息,仍可以更新得到最新的设备之间的相对位置,也就是说,按照上述方法进行定位,可以实时更新得到目前设备的最新位置,给用户的感觉就是设备越用月聪明。In this embodiment, the calculation process of the relative positions between the multiple devices relies on the information of the historical wake-up voices. By receiving the arrival time difference information of the wake-up voices for multiple devices multiple times, and the arrival time difference of the wake-up voices reaching different microphones of the same device, Get the relative position of each device. Because the above method can continuously use the historical user wake-up voice in the recent period of time to locate the device, even if the location of the device changes during the use of the device, for example, the smart speaker is moved from the living room to the dining room, according to the above method, By accumulating the information of the historical wake-up voice within a period of time after the smart speaker is moved, the latest relative position between the devices can still be updated. The feeling to the user is that the device gets smarter the longer it is used.
另外,本实施例中并不要求拾音设备的使用环境,也就是说,即使设备处于存在噪声干扰的环境中,也可以通过用户语音多个语音识别,反向推导出多个拾音设备之间的相对位置。In addition, this embodiment does not require the use environment of the sound pickup device, that is to say, even if the device is in an environment with noise interference, it is possible to reversely deduce the relationship between multiple sound pickup devices through multiple voice recognition of the user's voice. relative position between.
为了实现上述设备地图的训练过程,本申请实施例中,在多设备场景的各个设备中,构建了设备的多层感知能力层,如图10所示,具体包括:基础感知能力层、二层感知能力层、高阶感知能力层。In order to realize the training process of the above device map, in the embodiment of the present application, in each device in a multi-device scenario, a multi-layer perception capability layer of the device is constructed, as shown in FIG. 10 , which specifically includes: a basic perception capability layer, a second layer Perceptual ability layer, high-level perception ability layer.
其中,基础感知能力层是指一些设备或是软件已有的功能被智慧感知框架简单封装了一层之后的能力。如芯片层提供人脸朝向能力,多麦克风设备的音源定位能力,安卓层提供的来去电状态监控能力等。Among them, the basic perception capability layer refers to the capability after the existing functions of some devices or software are simply encapsulated by the intelligent perception framework. For example, the chip layer provides the face orientation capability, the audio source positioning capability for multi-microphone devices, and the incoming and outgoing power status monitoring capability provided by the Android layer.
二层感知能力层指的是通过计算处理基础感知能力过后得到的计算结果,通过框架封装过后的能力。如这里的设备和设备间的定位能力,用户发声定位能力,都是通过处理底层不同设备的上报的声音定位数据(基础感知能力)所获得的计算结果。The second-layer perception capability layer refers to the computing result obtained after processing the basic perception capability through calculation, and the capability after being encapsulated by the framework. For example, the device-to-device localization capability and the user's voice localization capability are all calculation results obtained by processing the sound localization data (basic perception capability) reported by different underlying devices.
高阶感知能力层指的是通过复杂的计算、融合、模型、规则计算出来的向上层上报的复杂计算结果。如指向性唤醒能力,通过融合多个基础感知能力和二层感知能力,最后组合计算出的高级能力。The high-level perception capability layer refers to the complex calculation results reported to the upper layer calculated through complex calculations, fusions, models, and rules. For example, the directional awakening ability, through the fusion of multiple basic perception capabilities and second-level perception capabilities, the calculated advanced capabilities are finally combined.
另外,本申请在软件实现过程中,还提供围栏机制,指的是一个虚拟的栅栏围出一个虚拟边界。当手机进入、离开某个特定地理区/特定的动作等等,手机上层应用可以接收自 动通知和警告。围栏思想是智慧感知的核心,每个能力都会在配上一个相应的围栏来触发。能力配上围栏就组成了一个平台,为上层提供感知用户行为的能力,上层应用在用户进入和退出特定的事件时都能收到上报。In addition, in the software implementation process of the present application, a fence mechanism is also provided, which means that a virtual fence encloses a virtual boundary. When the mobile phone enters or leaves a certain geographic area/specific action, etc., the upper-layer application of the mobile phone can receive automatic notifications and warnings. Fence thinking is the core of intelligent perception, and each ability will be triggered with a corresponding fence. Capability and fences form a platform that provides the upper layer with the ability to perceive user behavior, and upper-layer applications can receive reports when users enter and exit specific events.
如图11所示,为本申请实施例提供的一种多设备场景下的软件模块示意图,从设备1至从设备n和主设备中的模块配合可以实现本申请实施例提供的设备地图训练方法。其中,主设备指的是上文中的用于定位设备位置的中枢设备,例如电子设备30,从设备指的是上文中用于采集唤醒语音的设备,例如电子设备10和电子设备20等。As shown in FIG. 11 , which is a schematic diagram of software modules in a multi-device scenario provided by an embodiment of the present application, the modules in the slave device 1 to the slave device n and the master device can cooperate to implement the device map training method provided by the embodiment of the present application. . The master device refers to the above-mentioned central device for locating the device position, such as the electronic device 30, and the slave device refers to the above-mentioned device for collecting wake-up voice, such as the electronic device 10 and the electronic device 20.
图11所示场景中各个从设备中包括音频收集模块1101,主设备中包括音频收集模块1101、音频处理模块1102和音频识别模块1103。In the scenario shown in FIG. 11 , each slave device includes an audio collection module 1101 , and the master device includes an audio collection module 1101 , an audio processing module 1102 , and an audio identification module 1103 .
其中,音频收集模块1101,用于利用多麦克风阵列采集环境中的声音,并将采集的声音转化为音频。Among them, the audio collection module 1101 is used to collect the sound in the environment by using a multi-microphone array, and convert the collected sound into audio.
各个设备的音频收集模块1101可以将各个采样时段对应的音频发送给主设备(例如电子设备30)的音频处理模块1102处理。其中,主设备可以是图3中的智慧屏设备,也可以是智能家居场景中的计算能力强的设备,例如,路由器或智能音箱等。The audio collection module 1101 of each device may send the audio corresponding to each sampling period to the audio processing module 1102 of the main device (eg, the electronic device 30 ) for processing. The main device may be the smart screen device shown in FIG. 3 , or may be a device with strong computing power in a smart home scenario, such as a router or a smart speaker.
主设备的音频处理模块1102,用于对各个采样时段对应的音频进行预处理,例如声道转换、平滑处理、降噪处理等,以便于音频识别模块1103后续进行唤醒语音的检测。The audio processing module 1102 of the main device is used to preprocess the audio corresponding to each sampling period, such as channel conversion, smoothing, noise reduction, etc., so that the audio recognition module 1103 can perform subsequent wake-up speech detection.
主设备的音频识别模块1103,用于对预处理后的各个采样时段的音频进行唤醒信号识别,从不同采样时段的音频中识别出相同的唤醒信号的信息,如唤醒信号所在音频的到达时刻等。音频识别模块1103将识别出相同的唤醒信号的信息发送至设备地图计算模块1104。The audio identification module 1103 of the main device is used to identify the wake-up signal for the audio of each sampling period after preprocessing, and identify the information of the same wake-up signal from the audio of different sampling periods, such as the arrival time of the audio where the wake-up signal is located, etc. . The audio recognition module 1103 sends the information identifying the same wake-up signal to the device map calculation module 1104 .
设备地图计算模块1104,用于根据相同唤醒语音到达不同设备的到达时刻,以及唤醒信号到达同一设备的不同麦克风的到达时刻,计算出不同设备之间的相对位置。The device map calculation module 1104 is configured to calculate the relative positions between different devices according to the arrival time of the same wake-up voice to different devices and the arrival time of the wake-up signal to different microphones of the same device.
另外,从设备1至从设备n和主设备中的模块配合可以实现本申请实施例提供的语音唤醒方法。为了实现该语音唤醒方法,图12所示场景中各个从设备中还可以包括用户位置定位模块1105、人脸朝向识别模块1106和指向性唤醒模块1107。In addition, the modules in the slave device 1 to the slave device n and the master device can cooperate to implement the voice wake-up method provided by the embodiment of the present application. In order to implement the voice wake-up method, each slave device in the scenario shown in FIG. 12 may further include a user position positioning module 1105 , a face orientation recognition module 1106 and a directional wake-up module 1107 .
其中,用户位置定位模块1105用于利用单个设备中的多麦克风阵列,计算用户当前发出的唤醒语音到达多麦克风设备的到达时刻,基于至少两个设备采集的唤醒语音到达多麦克风设备的到达时刻,计算出用户发声位置。Wherein, the user position positioning module 1105 is used to use the multi-microphone array in a single device to calculate the arrival time of the wake-up voice currently sent by the user to the multi-microphone device, and the arrival time of the multi-microphone device based on the wake-up voice collected by at least two devices, Calculate the position of the user's voice.
人脸朝向识别模块1106,用于在设备的摄像头的广角摄像头范围内,利用芯片层人脸朝向识别能力,计算出位置地图中用户的人脸指向。The face orientation recognition module 1106 is used to calculate the face orientation of the user in the location map by using the face orientation recognition capability of the chip layer within the range of the wide-angle camera of the camera of the device.
指向性唤醒模块1107,用于使用设备地图计算模块1104计算处理的设备地图、用户位置定位模块1105定位出来的用户位置,还有人脸朝向,得到待唤醒的目标设备。The directional wake-up module 1107 is used to obtain the target device to be woken up by using the device map calculated and processed by the device map calculation module 1104, the user position located by the user position location module 1105, and the face orientation.
应理解,图11所示的软件结构仅是一个示例。本申请实施例的电子设备可以具有比图中所示电子设备更多的或者更少的模块,可以组合两个或更多的模块等。图中所示出的各个模块可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。It should be understood that the software structure shown in FIG. 11 is only an example. The electronic device of the embodiments of the present application may have more or less modules than the electronic device shown in the figure, two or more modules may be combined, and so on. The various modules shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
需要说明的是,图11中所示的音频处理模块1102、音频识别模块1103、设备地图计算模块1104、用户位置定位模块1105、人脸朝向识别模块1106和指向性唤醒模块1107可以集 成在图2所示的处理器210中的一个或多个处理单元中,例如,音频处理模块1102、音频识别模块1103、设备地图计算模块1104、用户位置定位模块1105、人脸朝向识别模块1106和指向性唤醒模块1107中的部分或全部可以集成在应用处理器、专用处理器等一个或多个处理器中。需要说明的是,本申请实施例中的专用处理器可以为DSP、专用集成电路(application specific integrated circuit,ASIC)芯片等。It should be noted that the audio processing module 1102, the audio recognition module 1103, the device map calculation module 1104, the user position positioning module 1105, the face orientation recognition module 1106 and the directional wake-up module 1107 shown in FIG. 11 can be integrated in FIG. 2 Among the one or more processing units in the processor 210 shown, for example, the audio processing module 1102, the audio recognition module 1103, the device map calculation module 1104, the user location positioning module 1105, the face orientation recognition module 1106, and the directional wake-up Some or all of the modules 1107 may be integrated into one or more processors such as an application processor, a special purpose processor, or the like. It should be noted that, the dedicated processor in the embodiment of the present application may be a DSP, an application specific integrated circuit (application specific integrated circuit, ASIC) chip, or the like.
以下实施例均可以在具有上述硬件结构和/或软件结构的电子设备中实现。The following embodiments can all be implemented in an electronic device having the above-mentioned hardware structure and/or software structure.
基于相同的构思,图12所示为本申请提供的一种设备1200。设备1200包括至少一个处理器1210、存储器1220和收发器1230。其中,处理器1210与存储器1220和收发器1230耦合,本申请实施例中的耦合是装置、单元或模块之间的间接耦合或通信连接,可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。本申请实施例中不限定上述收发器1230、处理器1210以及存储器1220之间的连接介质。例如,本申请实施例在图12中以存储器1220、处理器1210以及收发器1230之间可以通过总线连接,所述总线可以分为地址总线、数据总线、控制总线等。Based on the same concept, FIG. 12 shows a device 1200 provided by the present application. Device 1200 includes at least one processor 1210 , memory 1220 and transceiver 1230 . The processor 1210 is coupled with the memory 1220 and the transceiver 1230. The coupling in this embodiment of the present application is an indirect coupling or communication connection between devices, units, or modules, which may be electrical, mechanical, or other forms for the device , information exchange between units or modules. The connection medium between the transceiver 1230, the processor 1210, and the memory 1220 is not limited in the embodiments of the present application. For example, in the embodiment of the present application, in FIG. 12 , the memory 1220 , the processor 1210 , and the transceiver 1230 may be connected through a bus, and the bus may be divided into an address bus, a data bus, a control bus, and the like.
具体的,存储器1220用于存储程序指令。Specifically, the memory 1220 is used to store program instructions.
收发器1230用于接收或发送数据。The transceiver 1230 is used to receive or transmit data.
处理器1210用于调用存储器1220中存储的程序指令,使得设备1200执行上述电子设备30所执行的步骤,或者执行上述方法中电子设备10或电子设备20所执行的步骤。The processor 1210 is configured to invoke the program instructions stored in the memory 1220, so that the device 1200 executes the steps performed by the electronic device 30 or the steps performed by the electronic device 10 or the electronic device 20 in the above method.
在本申请实施例中,处理器1210可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。In this embodiment of the present application, the processor 1210 may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, which can implement Alternatively, each method, step, and logic block diagram disclosed in the embodiments of the present application are executed. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware processor, or executed by a combination of hardware and software modules in the processor.
在本申请实施例中,存储器1220可以是非易失性存储器,比如硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。In this embodiment of the present application, the memory 1220 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), Such as random-access memory (random-access memory, RAM). Memory is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory in this embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, for storing program instructions and/or data.
应理解,该设备1200可以用于实现本申请实施例所示的方法,相关特征可以参照上文,此处不再赘述。It should be understood that the device 1200 can be used to implement the methods shown in the embodiments of the present application, and the relevant features can be referred to above, which will not be repeated here.
所属领域的技术人员可以清楚地了解到本申请实施例可以用硬件实现,或固件实现,或它们的组合方式来实现。当使用软件实现时,可以将上述功能存储在计算机可读介质中或作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是计算机能够存取的任何可用介质。以此为例但不限于:计算机可读介质可以包括RAM、ROM、电可擦可编程只读存储器(electrically erasable programmable read only memory,EEPROM)、只读光盘(compact disc read-Only memory,CD-ROM)或其他光盘存储、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质。此外。任何连接可以适当的成为计算机可读介质。例如,如果软件是使用同轴电缆、光纤光缆、 双绞线、数字用户线(digital subscriber line,DSL)或者诸如红外线、无线电和微波之类的无线技术从网站、服务器或者其他远程源传输的,那么同轴电缆、光纤光缆、双绞线、DSL或者诸如红外线、无线和微波之类的无线技术包括在所属介质的定影中。如本申请实施例所使用的,盘(disk)和碟(disc)包括压缩光碟(compact disc,CD)、激光碟、光碟、数字通用光碟(digital video disc,DVD)、软盘和蓝光光碟,其中盘通常磁性的复制数据,而碟则用激光来光学的复制数据。上面的组合也应当包括在计算机可读介质的保护范围之内。Those skilled in the art can clearly understand that the embodiments of the present application may be implemented by hardware, firmware, or a combination thereof. When implemented in software, the functions described above may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage medium can be any available medium that a computer can access. Taking this as an example but not limited to: computer readable media may include RAM, ROM, electrically erasable programmable read only memory (EEPROM), compact disc read-Only memory (CD- ROM) or other optical disk storage, magnetic disk storage media, or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and that can be accessed by a computer. also. Any connection can be appropriately made into a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, Then coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, wireless, and microwave are included in the fusing of the pertinent medium. As used in the embodiments of the present application, disks and discs include compact discs (CDs), laser discs, optical discs, digital video discs (DVDs), floppy disks, and Blu-ray discs, wherein Disks usually reproduce data magnetically, while discs use lasers to reproduce data optically. Combinations of the above should also be included within the scope of computer-readable media.
总之,以上所述仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡根据本申请的揭露,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。In a word, the above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. Any modification, equivalent replacement, improvement, etc. made according to the disclosure of this application shall be included within the protection scope of this application.

Claims (17)

  1. 一种语音唤醒方法,应用于第一电子设备,其特征在于,所述方法包括:A voice wake-up method, applied to a first electronic device, characterized in that the method comprises:
    接收用户的语音唤醒指令;Receive the user's voice wake-up command;
    获取用户图像,检测用户人脸朝向;Obtain the user image and detect the user's face orientation;
    根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,从所述第一电子设备和至少一个第二电子设备中确定用户人脸所朝向的目标设备;According to the relative positions of the first electronic device and at least one second electronic device, the user's position and the orientation of the user's face, determine the orientation of the user's face from the first electronic device and the at least one second electronic device target device;
    指示所述目标设备响应所述语音唤醒指令。The target device is instructed to respond to the voice wake-up command.
  2. 根据权利要求1所述的方法,其特征在于,根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,确定用户人脸所朝向的目标设备,包括:The method according to claim 1, wherein the target device to which the user's face is facing is determined according to the relative positions of the first electronic device and at least one second electronic device, the user's position, and the user's face orientation ,include:
    根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,从所述第一电子设备和至少一个第二电子设备中确定用户人脸所朝向的候选设备;According to the relative positions of the first electronic device and at least one second electronic device, the user's position and the orientation of the user's face, determine the orientation of the user's face from the first electronic device and the at least one second electronic device candidate device;
    当所述候选设备的个数大于或等于两个时,确定所述用户和所述至少两个候选设备的相对距离;When the number of the candidate devices is greater than or equal to two, determining the relative distance between the user and the at least two candidate devices;
    根据所述相对距离,确定所述候选设备的优先级,其中,相对距离越小的候选设备的优先级越小;According to the relative distance, the priority of the candidate device is determined, wherein the smaller the relative distance is, the smaller the priority of the candidate device is;
    确定最高优先级对应的候选设备为目标设备。Determine the candidate device corresponding to the highest priority as the target device.
  3. 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that,
    获取所述第一电子设备的第一音频的信息,以及从所述至少一个第二电子设备获取第二音频的信息;acquiring first audio information of the first electronic device, and acquiring second audio information from the at least one second electronic device;
    根据所述第一音频的信息和所述第二音频的信息,确定所述用户位置。The user location is determined according to the information of the first audio and the information of the second audio.
  4. 根据权利要求3所述的方法,其特征在于,所述第一电子设备包括第一麦克风和第二麦克风;所述第一音频的信息包括:所述语音唤醒指令到达所述第一麦克风的第一到达时刻,所述语音唤醒指令到达所述第二麦克风的第二到达时刻,以及所述语音唤醒指令到达所述第一麦克风的第一相位、所述语音唤醒指令到达所述第二麦克风的第二相位;The method according to claim 3, wherein the first electronic device comprises a first microphone and a second microphone; the information of the first audio comprises: the voice wake-up instruction reaches the first microphone of the first microphone. At the arrival time, the voice wake-up command reaches the second arrival time of the second microphone, and the voice wake-up command reaches the first phase of the first microphone, and the voice wake-up command reaches the second microphone. second phase;
    所述至少一个第二电子设备包括第三麦克风和第四麦克风;所述第二音频的信息包括:所述语音唤醒指令到达所述第三麦克风的第三到达时刻,所述语音唤醒指令到达所述第四麦克风的第四到达时刻,以及所述语音唤醒指令到达所述第三麦克风的第三相位、所述语音唤醒指令到达所述第四麦克风的第四相位。The at least one second electronic device includes a third microphone and a fourth microphone; the second audio information includes: the voice wake-up command arrives at the third arrival time of the third microphone, and the voice wake-up command reaches the third microphone. the fourth arrival time of the fourth microphone, the third phase at which the voice wake-up command reaches the third microphone, and the fourth phase at which the voice wake-up command reaches the fourth microphone.
  5. 根据权利要求4所述的方法,其特征在于,根据所述第一音频的信息和所述第二音频的信息,确定所述用户位置,包括:The method according to claim 4, wherein determining the user location according to the information of the first audio and the information of the second audio comprises:
    根据所述第一到达时刻和第二到达时刻之间的时间差,确定所述用户和所述第一电子设备的相对距离,以及根据所述第一相位和第二相位之间的相位差,确定所述用户相对于所述第一电子设备的方位角;Determine the relative distance between the user and the first electronic device according to the time difference between the first arrival moment and the second arrival moment, and determine the relative distance between the first phase and the second phase according to the phase difference between the first phase and the second phase an azimuth of the user relative to the first electronic device;
    根据所述第二到达时刻和第三到达时刻之间的时间差,确定所述用户和所述至少一个第二电子设备的相对距离;以及根据所述第三相位和第四相位之间的相位差,确定所述用户相对于所述至少一个第二电子设备的方位角;determining a relative distance between the user and the at least one second electronic device according to the time difference between the second arrival moment and the third arrival moment; and according to the phase difference between the third phase and the fourth phase , determining the azimuth of the user relative to the at least one second electronic device;
    根据所述用户和所述第一电子设备的相对距离、所述用户相对于所述第一电子设备的方位角和所述用户和所述至少一个第二电子设备的相对距离,以及所述用户相对于所述至 少一个第二电子设备的方位角,确定所述用户位置。According to the relative distance of the user and the first electronic device, the azimuth of the user relative to the first electronic device and the relative distance of the user and the at least one second electronic device, and the user The user location is determined relative to the azimuth of the at least one second electronic device.
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 5, wherein the method further comprises:
    从所述第一电子设备和所述至少一个第二电子设备获取历史音频的信息;Obtain historical audio information from the first electronic device and the at least one second electronic device;
    从所述历史音频的信息中,获取所述用户N次发出的语音唤醒指令分别到达不同电子设备的到达时刻和相位,其中N为正整数;From the information of the historical audio, obtain the arrival times and phases of the voice wake-up commands issued by the user N times to different electronic devices, where N is a positive integer;
    根据所述用户N次发出的语音唤醒指令到达不同电子设备的到达时刻和相位,确定与用户N次发出的语音唤醒指令对应的相对方位角和距离差;Determine the relative azimuth and distance difference corresponding to the voice wake-up commands issued by the user N times according to the arrival times and phases of the voice wake-up commands issued by the user N times to different electronic devices;
    以用户N次发出的语音唤醒指令对应的相对方位角和距离差为观测值,建立目标函数;The objective function is established by taking the relative azimuth and distance difference corresponding to the voice wake-up commands issued by the user N times as the observed values;
    通过穷举搜索的方法对所述目标函数求解,得到所述第一电子设备和至少一个第二电子设备的相对位置。The objective function is solved by an exhaustive search method to obtain the relative positions of the first electronic device and at least one second electronic device.
  7. 根据权利要求6所述的方法,其特征在于,根据所述用户N次发出的语音唤醒指令到达不同电子设备的到达时刻和相位,确定与用户N次发出的语音唤醒指令对应的相对方位角和距离差,包括:The method according to claim 6, wherein, according to the arrival times and phases of the voice wake-up commands issued by the user N times to different electronic devices, the relative azimuth angle and the relative azimuth angle corresponding to the voice wake-up commands issued by the user N times are determined distance differences, including:
    针对所述用户N次发出的语音唤醒指令中的第K次发出的语音唤醒指令,第K次为N次中的任意一次,用户所述第K次发出的语音唤醒指令对应的所述距离差满足如下公式一:For the voice wake-up command issued at the Kth time among the voice wake-up commands issued by the user N times, the Kth time is any one of the N times, and the distance difference corresponding to the voice wake-up command issued by the user at the Kth time The following formula 1 is satisfied:
    Δτ1·v=a1  公式一Δτ1·v=a1 Formula 1
    其中,Δτ1为用户发出的唤醒信号到两个不同电子设备的时间差,v为声音传输的速度,a1为用户发声位置到两个不同电子设备的距离差;Among them, Δτ1 is the time difference between the wake-up signal sent by the user and two different electronic devices, v is the speed of sound transmission, and a1 is the distance difference between the user's voice position and two different electronic devices;
    用户所述第K次发出的语音唤醒指令对应的所述相对方位角满足如下公式二:The relative azimuth angle corresponding to the voice wake-up command issued by the user for the Kth time satisfies the following formula 2:
    180°-θA1-θB1=b1  公式二180°-θA1-θB1=b1 Formula 2
    其中,θA1为用户发生位置相对于所述第二电子设备的方位角,θB1为用户发声位置相对于所述第一电子设备的方位角;θB1为发声点P1到B点与直角坐标系C的x轴的夹角,b1为用户发声位置相对于两个不同电子设备的相对方位角。Among them, θA1 is the azimuth angle of the user's position relative to the second electronic device, θB1 is the azimuth angle of the user's sounding position relative to the first electronic device; θB1 is the sounding point P1 to point B and the Cartesian coordinate system C The included angle of the x-axis, b1 is the relative azimuth angle of the user's voice position relative to two different electronic devices.
  8. 根据权利要求1至7任一所述的方法,其特征在于,所述第一电子设备和所述至少一个第二电子设备连接同一局域网,或所述第一电子设备和所述至少一个第二电子设备预先绑定有同一用户账号,或者所述第一电子设备和所述至少一个第二电子设备绑定不同的用户帐号,且不同的用户帐号建立绑定关系。The method according to any one of claims 1 to 7, wherein the first electronic device and the at least one second electronic device are connected to the same local area network, or the first electronic device and the at least one second electronic device are connected to the same local area network. The electronic device is pre-bound with the same user account, or the first electronic device and the at least one second electronic device are bound with different user accounts, and the different user accounts establish a binding relationship.
  9. 根据权利要求1至7任一所述的方法,其特征在于,所述第一电子设备具有图像采集功能,或所述至少一个第二电子设备具有图像采集功能。The method according to any one of claims 1 to 7, wherein the first electronic device has an image capture function, or the at least one second electronic device has an image capture function.
  10. 一种语音唤醒系统,其特征在于,该系统包括第一电子设备和至少一个第二电子设备;A voice wake-up system, characterized in that the system includes a first electronic device and at least one second electronic device;
    所述第二电子设备,用于通过多麦克风阵列采集周围环境的声音并转换成第二音频,所述周围环境的声音包括用户发出的语音唤醒指令;The second electronic device is configured to collect the sound of the surrounding environment through a multi-microphone array and convert it into a second audio frequency, where the sound of the surrounding environment includes a voice wake-up instruction issued by the user;
    所述第二电子设备,还用于向所述第一电子设备发送第二音频的信息;the second electronic device, further configured to send second audio information to the first electronic device;
    所述第一电子设备,用于获取所述第一电子设备的第一音频的信息,以及接收来自所述至少一个第二电子设备获取第二音频的信息;根据所述第一音频的信息和所述第二音频的信息,确定所述用户位置;根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,从所述第一电子设备和至少一个第二电子设备中确定用户 人脸所朝向的目标设备;The first electronic device is configured to acquire information of the first audio of the first electronic device, and receive information from the at least one second electronic device to acquire the second audio; according to the information of the first audio and the information of the second audio, to determine the user position; according to the relative position of the first electronic device and at least one second electronic device, the user position and the face orientation of the user, from the first electronic device and the user determining the target device to which the user's face faces in at least one second electronic device;
    所述第一电子设备,还用于指示所述目标设备响应所述语音唤醒指令。The first electronic device is further configured to instruct the target device to respond to the voice wake-up instruction.
  11. 根据权利要求10所述的系统,其特征在于,所述第一电子设备根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,确定用户人脸所朝向的目标设备,具体用于:The system according to claim 10, wherein the first electronic device determines the user's personality according to the relative positions of the first electronic device and the at least one second electronic device, the user's position and the face orientation of the user The target device the face is facing, specifically for:
    根据所述第一电子设备和至少一个第二电子设备的相对位置、用户位置和所述用户人脸朝向,从所述第一电子设备和至少一个第二电子设备中确定用户人脸所朝向的候选设备;According to the relative positions of the first electronic device and at least one second electronic device, the user's position and the orientation of the user's face, determine the orientation of the user's face from the first electronic device and the at least one second electronic device candidate device;
    当所述候选设备的个数大于或等于两个时,确定所述用户和所述至少两个候选设备的相对距离;When the number of the candidate devices is greater than or equal to two, determining the relative distance between the user and the at least two candidate devices;
    根据所述相对距离,确定所述候选设备的优先级,其中,相对距离越小的候选设备的优先级越小;According to the relative distance, the priority of the candidate device is determined, wherein the smaller the relative distance is, the smaller the priority of the candidate device is;
    确定最高优先级对应的候选设备为目标设备。Determine the candidate device corresponding to the highest priority as the target device.
  12. 根据权利要求10或11所述的系统,其特征在于,A system according to claim 10 or 11, characterized in that:
    所述第一电子设备包括第一麦克风和第二麦克风;所述第一音频的信息包括:所述语音唤醒指令到达所述第一麦克风的第一到达时刻,所述语音唤醒指令到达所述第二麦克风的第二到达时刻,以及所述语音唤醒指令到达所述第一麦克风的第一相位、所述语音唤醒指令到达所述第二麦克风的第二相位;The first electronic device includes a first microphone and a second microphone; the information of the first audio includes: a first arrival time of the voice wake-up command to the first microphone, and the voice wake-up command reaches the first arrival time. The second arrival time of the second microphone, and the first phase at which the voice wake-up command reaches the first microphone, and the second phase at which the voice wake-up command reaches the second microphone;
    所述至少一个第二电子设备包括第三麦克风和第四麦克风;所述第二音频的信息包括:所述语音唤醒指令到达所述第三麦克风的第三到达时刻,所述语音唤醒指令到达所述第四麦克风的第四到达时刻,以及所述语音唤醒指令到达所述第三麦克风的第三相位、所述语音唤醒指令到达所述第四麦克风的第四相位。The at least one second electronic device includes a third microphone and a fourth microphone; the second audio information includes: the voice wake-up command arrives at the third arrival time of the third microphone, and the voice wake-up command reaches the third microphone. the fourth arrival time of the fourth microphone, the third phase at which the voice wake-up command reaches the third microphone, and the fourth phase at which the voice wake-up command reaches the fourth microphone.
  13. 根据权利要求12所述的系统,其特征在于,所述第一电子设备根据所述第一音频的信息和所述第二音频的信息,确定所述用户位置,具体用于:The system according to claim 12, wherein the first electronic device determines the user position according to the information of the first audio and the information of the second audio, and is specifically used for:
    根据所述第一到达时刻和第二到达时刻之间的时间差,确定所述用户和所述第一电子设备的相对距离,以及根据所述第一相位和第二相位之间的相位差,确定所述用户相对于所述第一电子设备的方位角;Determine the relative distance between the user and the first electronic device according to the time difference between the first arrival moment and the second arrival moment, and determine the relative distance between the first phase and the second phase according to the phase difference between the first phase and the second phase an azimuth of the user relative to the first electronic device;
    根据所述第二到达时刻和第三到达时刻之间的时间差,确定所述用户和所述至少一个第二电子设备的相对距离;以及根据所述第三相位和第四相位之间的相位差,确定所述用户相对于所述至少一个第二电子设备的方位角;determining a relative distance between the user and the at least one second electronic device according to the time difference between the second arrival moment and the third arrival moment; and according to the phase difference between the third phase and the fourth phase , determining the azimuth of the user relative to the at least one second electronic device;
    根据所述用户和所述第一电子设备的相对距离、所述用户相对于所述第一电子设备的方位角和所述用户和所述至少一个第二电子设备的相对距离,以及所述用户相对于所述至少一个第二电子设备的方位角,确定所述用户位置。According to the relative distance of the user and the first electronic device, the azimuth of the user relative to the first electronic device and the relative distance of the user and the at least one second electronic device, and the user The user location is determined relative to an azimuth of the at least one second electronic device.
  14. 根据权利要求10至13任一项所述的系统,其特征在于,所述第一电子设备还用于:The system according to any one of claims 10 to 13, wherein the first electronic device is further configured to:
    从所述第一电子设备和所述至少一个第二电子设备获取历史音频的信息;Obtain historical audio information from the first electronic device and the at least one second electronic device;
    从所述历史音频的信息中,获取所述用户N次发出的语音唤醒指令到达不同电子设备的到达时刻和相位,其中N为正整数;From the information of the historical audio, obtain the arrival times and phases of the voice wake-up commands issued by the user N times to different electronic devices, where N is a positive integer;
    根据所述用户N次发出的语音唤醒指令到达不同电子设备的到达时刻和相位,确定与用户N次发出的语音唤醒指令对应的相对方位角和距离差;Determine the relative azimuth angle and distance difference corresponding to the voice wake-up commands sent by the user N times according to the arrival times and phases of the voice wake-up commands sent by the user N times to different electronic devices;
    以用户N次发出的语音唤醒指令对应的相对方位角和距离差为观测值,建立目标函数;The objective function is established by taking the relative azimuth and distance difference corresponding to the voice wake-up commands issued by the user N times as the observed values;
    通过穷举搜索的方法对所述目标函数求解,得到所述第一电子设备和至少一个第二电 子设备的相对位置。The objective function is solved by an exhaustive search method to obtain the relative positions of the first electronic device and at least one second electronic device.
  15. 一种电子设备,其特征在于,所述电子设备包括处理器和存储器;An electronic device, characterized in that the electronic device includes a processor and a memory;
    所述存储器中存储有程序指令;Program instructions are stored in the memory;
    当所述程序指令被执行时,使得所述电子设备执行如权利要求1至9任一所述的方法。When the program instructions are executed, the electronic device is caused to perform the method of any one of claims 1 to 9.
  16. 一种芯片系统,其特征在于,所述芯片系统与电子设备中的存储器耦合,使得所述芯片在运行时调用所述存储器中存储的程序指令,实现如权利要求1至9任一所述的方法。A chip system, characterized in that the chip system is coupled with a memory in an electronic device, so that the chip invokes program instructions stored in the memory when running, so as to implement the method according to any one of claims 1 to 9. method.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括程序指令,当所述程序指令在电子设备上运行时,使得所述电子设备执行如权利要求1至9任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium comprises program instructions, which, when the program instructions are executed on an electronic device, cause the electronic device to perform any one of claims 1 to 9 the method described.
PCT/CN2021/133119 2020-11-27 2021-11-25 Voice wakeup method and electronic device WO2022111579A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011362525.1 2020-11-27
CN202011362525.1A CN114566171A (en) 2020-11-27 2020-11-27 Voice awakening method and electronic equipment

Publications (1)

Publication Number Publication Date
WO2022111579A1 true WO2022111579A1 (en) 2022-06-02

Family

ID=81711663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/133119 WO2022111579A1 (en) 2020-11-27 2021-11-25 Voice wakeup method and electronic device

Country Status (2)

Country Link
CN (1) CN114566171A (en)
WO (1) WO2022111579A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024000836A1 (en) * 2022-06-29 2024-01-04 歌尔股份有限公司 Voice control method and apparatus for home device, wearable device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273850A (en) * 2022-09-28 2022-11-01 科大讯飞股份有限公司 Autonomous mobile equipment voice control method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106772247A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 A kind of terminal and sound localization method
CN107465986A (en) * 2016-06-03 2017-12-12 法拉第未来公司 The method and apparatus of audio for being detected and being isolated in vehicle using multiple microphones
CN110415695A (en) * 2019-07-25 2019-11-05 华为技术有限公司 A kind of voice awakening method and electronic equipment
US20200061822A1 (en) * 2017-04-21 2020-02-27 Cloundminds (Shenzhen) Robotics Systems Co., Ltd. Method for controlling robot and robot device
US20200072937A1 (en) * 2018-02-12 2020-03-05 Luxrobo Co., Ltd. Location-based voice recognition system with voice command
CN111176744A (en) * 2020-01-02 2020-05-19 北京字节跳动网络技术有限公司 Electronic equipment control method, device, terminal and storage medium
CN111312295A (en) * 2018-12-12 2020-06-19 深圳市冠旭电子股份有限公司 Holographic sound recording method and device and recording equipment
CN111369988A (en) * 2018-12-26 2020-07-03 华为终端有限公司 Voice awakening method and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107465986A (en) * 2016-06-03 2017-12-12 法拉第未来公司 The method and apparatus of audio for being detected and being isolated in vehicle using multiple microphones
CN106772247A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 A kind of terminal and sound localization method
US20200061822A1 (en) * 2017-04-21 2020-02-27 Cloundminds (Shenzhen) Robotics Systems Co., Ltd. Method for controlling robot and robot device
US20200072937A1 (en) * 2018-02-12 2020-03-05 Luxrobo Co., Ltd. Location-based voice recognition system with voice command
CN111312295A (en) * 2018-12-12 2020-06-19 深圳市冠旭电子股份有限公司 Holographic sound recording method and device and recording equipment
CN111369988A (en) * 2018-12-26 2020-07-03 华为终端有限公司 Voice awakening method and electronic equipment
CN110415695A (en) * 2019-07-25 2019-11-05 华为技术有限公司 A kind of voice awakening method and electronic equipment
CN111176744A (en) * 2020-01-02 2020-05-19 北京字节跳动网络技术有限公司 Electronic equipment control method, device, terminal and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024000836A1 (en) * 2022-06-29 2024-01-04 歌尔股份有限公司 Voice control method and apparatus for home device, wearable device and storage medium

Also Published As

Publication number Publication date
CN114566171A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
WO2022111579A1 (en) Voice wakeup method and electronic device
RU2663937C2 (en) Method and device for flight management, as well as the electronic device
CN106782540B (en) Voice equipment and voice interaction system comprising same
WO2021027267A1 (en) Speech interaction method and apparatus, terminal and storage medium
CN102467574B (en) Mobile terminal and metadata setting method thereof
US11343613B2 (en) Prioritizing delivery of location-based personal audio
US20130300546A1 (en) Remote control method and apparatus for terminals
JP2020520206A (en) Wearable multimedia device and cloud computing platform with application ecosystem
CN108668077A (en) Camera control method, device, mobile terminal and computer-readable medium
US20180103197A1 (en) Automatic Generation of Video Using Location-Based Metadata Generated from Wireless Beacons
US9288594B1 (en) Auditory environment recognition
CN110691300B (en) Audio playing device and method for providing information
WO2017063283A1 (en) System and method for triggering smart vehicle-mounted terminal
CN103858497A (en) Method and apparatus for providing information based on a location
KR20180081922A (en) Method for response to input voice of electronic device and electronic device thereof
CN111477225A (en) Voice control method and device, electronic equipment and storage medium
CN105959587A (en) Shutter speed acquisition method and device
CN112735403B (en) Intelligent home control system based on intelligent sound equipment
CN107864430A (en) A kind of sound wave direction propagation control system and its control method
CN107707816A (en) A kind of image pickup method, device, terminal and storage medium
CN114299951A (en) Control method and device
CN110719545B (en) Audio playing device and method for playing audio
US9733714B2 (en) Computing system with command-sense mechanism and method of operation thereof
US20180143867A1 (en) Mobile Application for Capturing Events With Method and Apparatus to Archive and Recover
US10726270B2 (en) Selecting media from mass social monitoring devices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21897079

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21897079

Country of ref document: EP

Kind code of ref document: A1