WO2021013137A1 - 一种语音唤醒方法及电子设备 - Google Patents

一种语音唤醒方法及电子设备 Download PDF

Info

Publication number
WO2021013137A1
WO2021013137A1 PCT/CN2020/103130 CN2020103130W WO2021013137A1 WO 2021013137 A1 WO2021013137 A1 WO 2021013137A1 CN 2020103130 W CN2020103130 W CN 2020103130W WO 2021013137 A1 WO2021013137 A1 WO 2021013137A1
Authority
WO
WIPO (PCT)
Prior art keywords
electronic device
user
wake
voice
parameter
Prior art date
Application number
PCT/CN2020/103130
Other languages
English (en)
French (fr)
Inventor
陈祥
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021013137A1 publication Critical patent/WO2021013137A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of terminal technology, and in particular to a voice wake-up method and electronic equipment.
  • voice assistants such as Siri, Xiao Ai, Xiao E, etc.
  • Siri Xiao Ai
  • Xiao E Xiao E
  • one or more wake-up words are generally preset in the electronic device.
  • the electronic device can be triggered to start the voice assistant to conduct voice communication with the user.
  • the electronic device can use preset wakeup parameters to detect whether the user inputs a wakeup word. Taking the sound intensity threshold as an example of a wake-up parameter, the electronic device can set the sound intensity threshold of the wake-up word to 60dB. In other words, when the user's voice is stronger than 60 dB when the user inputs the wake-up word, the electronic device can confirm that the user has input the wake-up word to wake up the voice assistant. However, when the user is far from the electronic device, the electronic device detects that the sound intensity of the user's input is also reduced, resulting in the situation that the user cannot wake up the voice assistant when the user is far from the electronic device.
  • the present application provides a voice wake-up method and electronic device, which can ensure the probability that the electronic device can be successfully awakened in a wide range of locations, and improve the wake-up rate of the voice assistant in various locations and the user experience.
  • the present application provides a voice wake-up method, including: the electronic device acquires an image collected by a camera; further, the electronic device can determine whether the collected image includes a user; if the image includes a user, the electronic device can determine the above The first target location in the image where the first user is located; subsequently, after the user inputs the first voice, the electronic device can process the first voice according to the first target location; for example, if the first target location belongs to the preset first area, Then the electronic device can use the first parameter to process the first voice; if the first target location belongs to the preset second area, the electronic device can use the second parameter to process the first voice; then, if the processed first voice includes With the preset wake-up word, the voice interaction function of the electronic device is awakened. At this time, the voice interaction function of the electronic device is switched from the first state (for example, the standby state) to the second state (for example, the working state).
  • the voice interaction function of the electronic device is switched from the first state (for example, the stand
  • the electronic device can dynamically set different parameters to detect the wake-up words in the voice according to the location of the user, so that when the user inputs the wake-up words in different locations, the electronic device can use the corresponding parameters to detect the wake-up words input by the user, so that Electronic devices can maintain a high wake-up rate in different location scenarios, and improve the user experience in voice interaction scenarios.
  • the process of processing the first voice by the electronic device may include a process in which the electronic device collects the first voice input by the user, and may also include the process of performing analog-to-digital conversion and noise reduction on the first voice after the electronic device collects the first voice. Or voice processing such as signal amplification, this application does not impose any restrictions on this.
  • the foregoing image may include multiple users.
  • the method before the electronic device determines the first target location of the first user in the image, the method further includes: Determine the first user.
  • the foregoing first user may be the user with the highest priority among the foregoing multiple users.
  • the above-mentioned first user may be one or more users in the first area, the number of users in the first area is the largest among the preset N areas, and N is an integer greater than 1. That is, when the number of users in a certain area is the largest, the priority of the area is the highest, and one or more users in the area are the first users.
  • the foregoing first parameter may include one or more of: a first wake-up threshold, a first pickup direction, a first noise suppression parameter, and a first amplification gain; similarly, the foregoing first parameter
  • the two parameters may include one or more of the second wake-up threshold, the second pickup direction, the second noise suppression parameter, and the second amplification gain. These parameters refer to one or more parameters that can affect the wake-up rate of an electronic device.
  • the electronic device uses the first parameter to process the first voice, including: the electronic device can use the first wake-up threshold to determine whether the first voice needs to be processed; or; the electronic device can use the first pickup direction to collect The first voice; or; the electronic device can use the first noise suppression parameter to suppress the noise in the first voice; or; the electronic device can enhance the loudness of the first voice according to the first amplification gain.
  • the electronic device uses the second parameter to process the first voice, including: the electronic device can use the second wakeup threshold to determine whether the first voice needs to be processed; or; the electronic device can use the second pickup direction to collect The first voice; or; the electronic device can use the second noise suppression parameter to suppress the noise in the first voice; or; the electronic device can enhance the loudness of the first voice according to the second amplification gain.
  • the electronic device acquiring the image collected by the camera includes: when detecting that the electronic device is powered on, standby, powered on, or starts playing, the electronic device can start to acquire each frame of image collected by the camera.
  • the present application provides an electronic device, including: one or more cameras; one or more microphones; one or more processors; memory and one or more computer programs; wherein the processor, the camera, and the microphone And the memory are both coupled, the above one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so that the electronic device executes any of the above Voice wake-up method.
  • the present application provides a computer storage medium, including computer instructions, which when the computer instructions run on an electronic device, cause the electronic device to execute the voice wake-up method as described in any one of the first aspect.
  • this application provides a computer program product, which when the computer program product runs on an electronic device, causes the electronic device to execute the voice wake-up method as described in any one of the first aspect.
  • the electronic equipment described in the second aspect, the computer storage medium described in the third aspect, and the computer program product described in the fourth aspect provided above are all used to execute the corresponding methods provided above.
  • the beneficial effects that can be achieved please refer to the beneficial effects in the corresponding method provided above, which will not be repeated here.
  • Figure 1 is an interactive schematic diagram of a voice interaction process
  • FIG. 2 is a schematic diagram 1 of an application scenario of a voice wake-up method provided by an embodiment of this application;
  • FIG. 3 is a first structural diagram of an electronic device according to an embodiment of the application.
  • FIG. 4 is a schematic flowchart of a voice wake-up method provided by an embodiment of this application.
  • FIG. 5 is a schematic diagram 2 of an application scenario of a voice wake-up method provided by an embodiment of this application;
  • FIG. 6 is a third schematic diagram of an application scenario of a voice wake-up method provided by an embodiment of this application.
  • FIG. 7 is a fourth schematic diagram of an application scenario of a voice wake-up method provided by an embodiment of this application.
  • FIG. 8 is a schematic diagram 5 of an application scenario of a voice wake-up method provided by an embodiment of this application.
  • FIG. 9 is a sixth schematic diagram of an application scenario of a voice wake-up method provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram 7 of an application scenario of a voice wake-up method provided by an embodiment of this application.
  • FIG. 11 is an eighth schematic diagram of an application scenario of a voice wake-up method provided by an embodiment of this application.
  • FIG. 12 is a second structural diagram of an electronic device provided by an embodiment of this application.
  • the voice wake-up method provided by the embodiments of this application can be applied to speakers, smart home devices (such as smart TVs, smart air conditioners, smart refrigerators, etc.), mobile phones, tablet computers, laptops, netbooks, and personal digital assistants (personal digital assistants, Electronic devices with voice interaction functions, such as PDAs, wearable electronic devices, in-vehicle devices, or virtual reality devices, are not limited in the embodiments of the present application.
  • the voice interaction process can be divided into five links, namely, wake-up, response, input, understanding and feedback.
  • the electronic device needs to be awakened before voice interaction with the electronic device.
  • the user can activate the voice interaction function of the electronic device by inputting the correct wake-up word, so that the voice interaction function of the electronic device is switched from the standby state (ie, the first state) to the working state (ie, the second state).
  • the voice interaction function of the electronic device when the voice interaction function of the electronic device is in the standby state, after the electronic device receives the voice signal input by the user, it needs to recognize the wake-up word in the voice signal. If the preset wake word is recognized, the electronic device can turn on the voice interaction function to enter the working state.
  • the voice interaction function of the electronic device when the voice interaction function of the electronic device is in the working state, after receiving the voice signal input by the user, the electronic device can recognize the semantic content in the voice signal through a voice recognition algorithm, so as to respond to the voice signal to realize the corresponding function.
  • the speaker can set the microphone to be always on, and further, the speaker can detect the voice signal input by the user in real time through the microphone.
  • the speaker can wake up the voice assistant APP in the speaker to switch the speaker from the standby state to the working state.
  • the voice assistant APP After the voice assistant APP is awakened, it can answer the wake-up word "Xiaoyi Xiaoyi" input by the user, and start to receive voice commands input by the user. Furthermore, the speaker can understand the voice command input by the user by interacting with the server, and feedback the voice command input by the user to realize a complete voice interaction process.
  • the ability to successfully wake up the electronic device is the basis for realizing the voice interaction between the user and the electronic device.
  • One of the important factors for successfully waking up an electronic device is one or more parameters (also referred to as wake-up parameters) used by the electronic device when detecting a wake-up word input by the user.
  • the wake-up parameters may include one or more parameters such as wake-up threshold, sound pickup direction, noise suppression parameters, and amplification gain. The values of these parameters determine the wake-up rate of the electronic device when detecting wake-up words.
  • the wake-up threshold refers to the sound intensity threshold of a wake-up word that can successfully wake up an electronic device.
  • the electronic device collects the wake-up word input by the user, if it detects that the voice signal input by the user is stronger than the wake-up threshold, the electronic device can use the voice signal as a valid voice signal and continue to detect whether the voice signal is Contains the wake word. Conversely, the electronic device may discard the above voice signal as an invalid voice signal.
  • the user can only successfully wake up the electronic device in the area closer to the electronic device; but if the value of the wake-up threshold is too low, the user can wake up the electronic device even when it is far away from the electronic device , Which will increase the chance of electronic devices being awakened by mistake.
  • the sound pickup direction refers to the direction in which the electronic device receives the voice signal when the user inputs the wake word. If the user inputs a wake-up word in the sound pickup direction set by the electronic device, the electronic device is more likely to be successfully awakened. Conversely, if the user inputs a wake-up word in an area other than the pickup direction, the probability of the electronic device being successfully waked up decreases.
  • Noise suppression parameters are used to suppress noise outside the pickup direction. Then, if the user inputs a wake-up word in an area other than the pickup direction, when the electronic device uses the above noise suppression parameters for detection, the wake-up word input by the user at this time may be discarded as noise, resulting in the user failing to successfully input the wake-up word Wake up the electronic device.
  • the amplification gain refers to the amplification factor of the received voice signal when the electronic device detects the wake word, which is used to enhance the loudness of the voice signal. Similar to the wake-up threshold, if the value of the amplification gain is too low, the user can only successfully wake up the electronic device in the area close to the electronic device; if the value of the amplification gain is too high, the user will be far away from the electronic device. It can also wake up electronic devices, which will increase the chance of electronic devices being woken up by mistake.
  • the technician needs to test each parameter in the wake-up parameters, and finally select a set of wake-up parameters with a wake-up rate that meets the requirements and save it in the electronic device. Subsequently, the electronic device can use the saved set of wake-up parameters to detect in real time whether the voice signal input by the user contains the correct wake-up word, so as to wake up the electronic device to start the voice interaction function and enter the working state.
  • the specific location where the user inputs the wake-up word into the electronic device changes randomly.
  • the user can input a wake-up word to the electronic device anywhere in the home.
  • the electronic device receives and correctly detects the wake-up word and the probability of being successfully awakened is still different , Resulting in a lower wake-up rate for users to wake up electronic devices in certain locations.
  • the electronic device can dynamically set the current wake-up parameter according to the location of the user, so that when the user inputs a wake-up word in different positions, the electronic device can use the corresponding wake-up parameter to detect the wake-up word input by the user, so that the electronic device The device can maintain a high wake-up rate in different location scenarios.
  • the smart TV can collect user's image information through a camera, and then determine the user's current location information based on the collected image.
  • smart TVs can preset corresponding wake-up parameters. For example, as shown in Figure 2, location area 1 within 1 meter of the smart TV corresponds to wake-up parameter 1, location area 2 that is greater than 1 meter and less than or equal to 2 meters from the smart TV corresponds to wake-up parameter 2, and the distance is greater than 2
  • the location area 3 of meters and less than or equal to 3 meters corresponds to the wake-up parameter 3.
  • the wake-up rate of using wake-up parameter 1 to wake up the smart TV in location area 1 is greater than 90%
  • the wake-up rate of using wake-up parameter 2 to wake up the smart TV in location area 2 is greater than 90%
  • the wake-up parameter 3 in location area 3 to wake up the smart TV The arousal rate is also greater than 90%.
  • the smart TV can set the wake-up parameter to wake-up parameter 1. If the user's current location information belongs to location area 2, the smart TV can set the wake-up parameter to wake-up parameter 2. If the user's current location information belongs to location area 3, the smart TV can set the wakeup parameter to wakeup parameter 3. In this way, no matter where the user enters the wake-up word into the smart TV, the smart TV can use the corresponding wake-up parameters to more accurately detect the wake-up word entered by the user, so as to successfully wake up the smart TV to interact with the user and improve the voice User experience in interactive scenarios.
  • the wake-up parameters configured by the electronic device may also include other items or items that can affect the wake-up rate of the electronic device. There are multiple parameters, and the embodiment of this application does not impose any restriction on this.
  • FIG. 3 shows a schematic structural diagram of the foregoing electronic device.
  • the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a camera 140, a mobile communication module 150, a wireless communication module 160, an audio module 170, and a speaker 170A , Microphone 170B, sensor module 180 and so on.
  • a processor 110 an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a camera 140, a mobile communication module 150, a wireless communication module 160, an audio module 170, and a speaker 170A , Microphone 170B, sensor module 180 and so on.
  • USB universal serial bus
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the processor 110 may include one or more interfaces.
  • the interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G and the like applied to electronic devices.
  • the mobile communication module 150 may include one or more filters, switches, power amplifiers, low noise amplifiers (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves by the antenna 1, and perform processing such as filtering, amplifying and transmitting the received electromagnetic waves to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic waves for radiation via the antenna 1.
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110.
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • the wireless communication module 160 can provide applications in electronic devices including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • frequency modulation frequency modulation, FM
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating one or more communication processing modules.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110.
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, perform frequency modulation, amplify it, and convert it into electromagnetic wave radiation via the antenna 2.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 121 may be used to store one or more computer programs, and the one or more computer programs include instructions.
  • the processor 110 can run the above-mentioned instructions stored in the internal memory 121 to enable the electronic device to execute the method for intelligently recommending contacts provided in some embodiments of the present application, as well as various functional applications and data processing.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store the operating system; the storage program area can also store one or more application programs (such as a gallery, contacts, etc.) and so on.
  • the data storage area can store data (such as photos, contacts, etc.) created during the use of the electronic device.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, universal flash storage (UFS), etc.
  • the processor 110 executes the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor to cause the electronic device to perform the voice interaction provided in the embodiments of the present application. Methods, and various functional applications and data processing.
  • the electronic device can implement audio functions through the audio module 170, the speaker 170A, the microphone 170B, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
  • the speaker 170A also called a “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device can listen to music through the speaker 170A, or listen to a hands-free call.
  • the microphone 170B also called “microphone”, “microphone”, is used to convert sound signals into electric signals.
  • the user can approach the microphone 170B through the mouth to make a sound, and input the sound signal to the microphone 170B.
  • the electronic device may be provided with one or more microphones 170B. In other embodiments, the electronic device may be provided with two microphones 170B, which can realize noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device may also be provided with three, four or more microphones 170B to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the sensor module 180 may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc., in the embodiments of the present application There are no restrictions on this.
  • the electronic device may further include one or more cameras 140.
  • the camera 140 may be used to capture images, and the images may include still pictures or videos.
  • the object generates an optical image through the lens and projects it to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the camera 140 may send the collected image to the processor 110.
  • the processor 110 can recognize whether the image collected by the camera 140 contains user information through a face recognition algorithm. When the image contains user information, it means that the user has entered the collection area of the camera 140. Furthermore, the processor 110 may determine the location information of the user according to the user image collected by the camera 140 in real time.
  • the processor 110 may extract a user avatar in the image, and further, the processor 110 may calculate the distance between the user and the electronic device according to the proportion of the user avatar in the entire image.
  • the camera 140 may be a 3D depth camera. The image collected by the 3D depth camera contains the depth information of the scene. Then, after the processor 110 recognizes the user image in the image, it can acquire the depth information of the user, so as to determine the distance between the user and the electronic device.
  • the electronic device can also determine the location information of the user in combination with technologies such as sound source localization, which is not limited in the embodiment of the present application.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device.
  • the electronic device may include more or fewer components than those shown in the figure, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the electronic device when the above electronic device is a speaker, the electronic device may also include one or more devices such as a GPU, a display screen, and keys, and the embodiment of the present application does not impose any limitation on this.
  • the electronic device when the above-mentioned electronic device is a smart TV, the electronic device may also be equipped with one or more devices such as a remote control and an infrared sensor, and the embodiment of the present application does not impose any limitation on this.
  • the electronic device when the above electronic device is a mobile phone, the electronic device may also include one or more devices such as GPU, display screen, earphone jack, buttons, battery, motor, indicator, and SIM card interface. This does not impose any restrictions.
  • a voice wake-up method provided by an embodiment of the present application will be specifically introduced with reference to the accompanying drawings.
  • a smart TV is used as an example of an electronic device for voice interaction.
  • FIG. 4 is a schematic flowchart of a voice wake-up method provided by an embodiment of this application. As shown in Figure 4, the voice wake-up method may include:
  • the smart TV uses the camera to start collecting images.
  • a camera may be installed in an electronic device with a voice interaction function (such as a smart TV).
  • the camera can be used to capture images within a certain shooting range around the smart TV.
  • the smart TV can locate the user based on the user information in the collected image.
  • the smart TV may automatically turn on the camera to start collecting images after power-on.
  • the smart TV can automatically turn on the camera to start capturing images after entering the standby state.
  • the smart TV can automatically turn on the camera to start collecting images after it is turned on or started to play, and the embodiment of the present application does not impose any limitation on this.
  • a camera has a certain field of view (FOV).
  • FOV field of view
  • the FOV of the camera 501 refers to the angle ⁇ formed by the two edges of the maximum range through which the measured object can pass through the lens of the camera 501 as the vertex.
  • the size of the FOV determines the field of view of the camera 501.
  • the larger the FOV of the camera 501 the larger the field of view that the camera 501 can capture.
  • the target object exceeds the FOV of the camera 501, it will not be shot in the shooting frame.
  • each frame of images captured by the camera 501 can be collected in real time. Since the FOV of the camera 501 is 90°, each frame of image collected by the smart TV can monitor the screen content with the FOV within 90° (which may be referred to as the shooting range in the following).
  • the smart TV After the smart TV obtains each frame of images collected by the camera 501, it can use a preset face recognition algorithm to identify whether the collected images contain user images. For example, a smart TV can identify whether there are key facial features such as eyes, mouth, and nose in an image. If these facial features exist, the smart TV can determine that the collected image contains a portrait (that is, a user image). Or, the smart TV can also identify whether the collected images contain user images of a specific user. For example, the user Alice can input her face image into the smart TV in advance, and after the smart TV collects each frame of the image, it can identify whether the image contains the face image of the user Alice. If the face image of the user Alice is included, the smart TV can determine that the captured image contains the user image.
  • a smart TV can identify whether there are key facial features such as eyes, mouth, and nose in an image. If these facial features exist, the smart TV can determine that the collected image contains a portrait (that is, a user image). Or, the smart TV can also identify whether the collected images contain
  • the images collected by the smart TV may include one or more user images, or the images collected by the smart TV may not include user images, and the embodiment of the present application does not impose any limitation on this.
  • the smart TV determines the first target location where the user is located.
  • the smart TV may determine the first target location where the user is at this time according to the collected user image.
  • the smart TV may preset a plane coordinate system for the shooting range of its camera 501.
  • the plane coordinate system may use the position of the camera 501 as the origin O, the direction perpendicular to the smart TV screen as the y axis, and the direction parallel to the smart TV screen as the x axis.
  • the first target position where the user is located can be represented by a coordinate in the above-mentioned plane coordinate system.
  • the smart TV can determine the coordinates A (X1, Y1) of the first target position where the user is located in the above-mentioned plane coordinate system according to the position of the user image 601 in the collected image 602.
  • the user image 601 may include a face image of the user.
  • the user image 601 may also include an image of the user's body.
  • the smart TV can also use other positioning methods to determine the first target location where the user is.
  • the smart TV can also collect the user’s voice signal through multiple microphones.
  • the smart TV can calculate the user’s first target location based on the sound source positioning technology based on the direction and intensity of the collected sound signal. The application embodiment does not impose any restriction on this.
  • the smart TV acquires a first wake-up parameter corresponding to the first target location.
  • the shooting range of the camera 501 in the smart TV may be divided into multiple areas (hereinafter referred to as wake-up areas) in advance.
  • a set of corresponding parameters ie, wake-up parameters
  • the wake-up parameter refers to one or more parameters that can affect the wake-up rate of the smart TV.
  • the wake-up parameter may include one or more of the wake-up threshold, the pickup direction, the noise suppression parameter, and the amplification gain, which is not limited in the embodiment of the present application.
  • the shooting range of the camera 501 includes 4 wake-up areas (ie, wake-up area 1 to wake-up area 4). Among them, the distance from the wake-up area 1 to the wake-up area 4 to the smart TV increases sequentially. For each wake-up area, the developer can determine the wake-up parameters that can ensure a higher wake-up rate of the smart TV in the wake-up area by testing.
  • step S403 after the smart TV obtains the coordinates A (X1, Y1) of the first target location where the user is located through step S402, the smart TV can further determine the wake-up area to which the coordinates A (X1, Y1) belong.
  • the smart TV can set the current wake-up parameter to the wake-up parameter 3 corresponding to the wake-up area 3.
  • a variable H may be set in the smart TV, and the value of the variable H is used to identify the wake-up parameter being used.
  • the smart TV can set a set of default wake-up parameters.
  • the smart TV can set the variable H to the above-mentioned default wake-up parameter.
  • the smart TV can set the variable H to a wake-up parameter corresponding to the wake-up area (for example, the above-mentioned wake-up parameter 3).
  • the smart TV can dynamically adjust the wake-up parameters being used according to the user's position.
  • the shooting area may be divided into multiple wake-up areas according to the direction of the microphone.
  • a corresponding wake-up parameter can be set for each wake-up area to ensure that the electronic device has a higher wake-up rate in the wake-up area.
  • the shooting area can be divided into multiple wake-up areas according to the distance to the smart TV and the remoteness.
  • wake-up area 1 is an area closer to and in the center of the smart TV; wake-up area 2 is a farther and more remote area from the smart TV than wake-up area 1; wake-up area 3 is relatively The wake-up area 2 is a farther and more remote area from the smart TV.
  • a corresponding wake-up parameter can be set for each wake-up area to ensure that the electronic device has a higher wake-up rate in the wake-up area.
  • the smart TV may also determine the first target position and set the first wake-up parameter corresponding to the first target position.
  • the shooting range of the camera 501 includes user A, user B, and user C.
  • the images collected by the camera 501 include user images of user A, user B, and user C at the same time.
  • the smart TV after the smart TV acquires the images including user A, user B, and user C, the user with the highest priority can be determined among user A, user B, and user C. Furthermore, the smart TV may determine the location of the user with the highest priority as the first target location, thereby determining the first wake-up parameter corresponding to the first target location.
  • the smart TV can save the specific user who initiated the playback task last time. For example, if the user who recently performed the power-on task is user A, the smart TV may determine user A as the user with the highest priority among user A, user B, and user C.
  • the smart TV can save the users who wake up the smart TV every time in a recent period of time, so as to count the users who wake up the smart TV the most times. If user B is the user who wakes up the smart TV the most times, the smart TV may determine user B as the user with the highest priority among user A, user B, and user C.
  • a smart TV can save facial information or voiceprint information of one or more users. Then, if the smart TV detects that the voiceprint information (or face information) of a user among User A, User B, and User C matches the user’s voiceprint information (or face information) stored in advance, the smart TV can Determine the user as the user with the highest priority.
  • a smart TV can also use face recognition technology to recognize the age of a user based on the collected user portraits of users A, B, and C. Furthermore, the smart TV can determine the oldest user as the user with the highest priority, thereby reducing the chance that the young user will wake up the smart TV at will.
  • the smart TV can also determine the user who is the closest to the smart TV, or the farthest from the smart TV, or the user with the highest sound signal strength, or the user with the lowest sound signal strength as the user with the highest priority. This embodiment of the application does not do this. Any restrictions.
  • the smart TV can determine that user C is currently located in the wake-up area 3 according to the location of user C (ie, the first target location). In order to ensure that the user C can successfully wake up the smart TV in the wake-up area 3, the smart TV may set the current wake-up parameter to the wake-up parameter 3 corresponding to the wake-up area 3.
  • the smart TV may set the wake-up area with the largest number of users as the first target location, and set the current wake-up parameter as the wake-up parameter corresponding to the first target location.
  • the smart TV may determine the wake-up area 3 as the first target area where the user is at this time. In order to ensure that the user can successfully wake up the smart TV in the wake-up area 3, the smart TV may set the current wake-up parameter to the wake-up parameter 3 corresponding to the wake-up area 3.
  • each of the above-mentioned wake-up area 1 to wake-up area 3 includes 1 user.
  • the smart TV can determine the user with the highest priority among the three users according to the method in the foregoing embodiment. Furthermore, the smart TV can determine the location of the user with the highest priority as the first target location, thereby determining the first wake-up parameter corresponding to the first target location.
  • the smart TV can recognize the first target location of the user through the collected user images, and set the current wake-up parameter as the wake-up parameter corresponding to the first target location, so as to improve the user’s presence The success rate of waking up the smart TV at the location.
  • the smart TV uses the first wake-up parameter to detect whether the voice signal input by the user contains a wake-up word.
  • the smart TV After the smart TV sets the current wake-up parameter as the first wake-up parameter corresponding to the location of the user, it can collect the voice signal input by the user in real time according to the first wake-up parameter, and determine whether the voice signal is included in the first wake-up parameter. Contains preset wake words.
  • the smart TV may collect the voice signal input by the user this time according to the first sound pickup direction and the first noise suppression in the first wake-up parameter. Furthermore, the smart TV can enhance the loudness of the voice signal according to the first amplification gain in the first wake-up parameter. Furthermore, the smart TV can determine whether the voice signal is a valid voice input according to the first wakeup threshold in the first wakeup parameter. If the loudness of the aforementioned voice signal is greater than the first wake-up threshold, indicating whether the voice signal is a valid voice input, the smart TV can continue to recognize whether the voice signal contains the preset wake-up word.
  • the smart TV can set the current wake-up parameter as the wake-up parameter 3 according to the location information of the user. Furthermore, the smart TV can collect the current voice signal in real time according to the wake-up parameter 3. As shown in Figure 10, if the user inputs the voice signal of "Xiaoyi Xiaoyi" to the smart TV, the wake-up threshold and voice pickup set in the wake-up parameter 3 Parameters such as direction, noise suppression parameters and amplification gain are all parameters corresponding to wake-up zone 3.
  • smart TV using wake-up parameter 3 can more quickly and accurately detect that the voice signal input by the user in wake-up zone 3 contains the wake-up word" Xiaoyi Xiaoyi". After the smart TV detects that the user has entered the correct wake-up word, it can start the voice assistant APP to interact with the user by voice.
  • the smart TV can be adjusted to a higher wake-up rate before the user has spoken.
  • the smart TV can use the corresponding wake-up parameter to quickly and accurately recognize the wake-up word input by the user, thereby waking up the smart TV.
  • the smart TV can use the corresponding wake-up parameters to more accurately detect the wake-up word input by the user, so as to successfully wake up the smart TV for voice interaction with the user and improve the voice User experience in interactive scenarios.
  • the smart TV may continue to use the first wake-up parameter set in step S403 to detect the voice signal input by the user. For example, after the user wakes up the smart TV in the wake-up zone 3, the user can continue to input the voice signal of "playing news". In this way, after the smart TV is successfully awakened, the smart TV can still quickly and accurately recognize the voice signal input by the user, thereby improving the efficiency and accuracy of the voice interaction between the user and the smart TV.
  • users may change their position when interacting with the smart TV by voice. Because the camera 501 of the smart TV can capture images within the shooting range in real time. When the position of the user in the image changes, it means that the user has changed his position. Then, the smart TV can continue to perform the following steps S405-S407.
  • S405 The smart TV detects that the user has switched from the first target location to the second target location.
  • the camera 501 can obtain a shooting screen containing the user's image in real time. As shown in Figure 11, if it is detected that the user image 601 in the newly acquired adjacent N frames of the shooting screen has moved from point P to point Q on the shooting screen, and the distance between point P and point Q is greater than the threshold, the smart TV can It is determined that the user moves from the first target location to the second target location.
  • P point can be understood as the position where the user is located in the shooting frame.
  • P point (or Q point) is any point in the user image 601.
  • the center point of the user's face in the shooting screen may be used as the aforementioned P point (or Q point).
  • the center point of the user's body in the shooting screen may be used as the aforementioned P point (or Q point).
  • the position where the user stands (or sits) in the shooting picture can be used as the above-mentioned point P (or point Q), which is not limited in the embodiment of the present application.
  • the smart TV can determine the coordinate B of the second target position where the user is currently located in the preset plane coordinate system according to the position of the user's image in the latest frame of image collected. X2, Y2).
  • the smart TV can also locate the user's position according to each frame of the shooting screen containing the user image collected by the camera 501, or the smart TV can also periodically obtain the shooting screen containing the user's image, and perform the location of the user. Positioning. If the coordinates of the user position determined this time are different from the coordinates of the user position determined last time, it means that the user has switched from the first target position to the second target position.
  • S406 The smart TV acquires a second wake-up parameter corresponding to the second target location.
  • the smart TV can determine the wake-up area to which the second target location belongs according to the coordinate B of the second target location where the user is currently located. If the wake-up area to which the second target location belongs is the same as the wake-up area to which the first target location belongs in step S403, for example, the wake-up areas are all wake-up areas 3 shown in FIG. 9, the smart TV does not need to modify the current wake-up parameters, and The wake-up parameter of is still the wake-up parameter 3 corresponding to the wake-up area 3.
  • the smart TV can modify the current wake-up parameter to the wake-up parameter 2 (ie, the second wake-up parameter) corresponding to the wake-up area 2, so as to ensure the successful probability of the user waking up the smart TV at the second target location.
  • the wake-up parameter 2 ie, the second wake-up parameter
  • S407 The smart TV uses the second wake-up parameter to detect whether the voice signal input by the user contains a wake-up word.
  • step S404 after the smart TV sets the current wake-up parameter to the second wake-up parameter corresponding to the location of the user, it can collect the voice signal input by the user in real time according to the second wake-up parameter, and follow the second wake-up parameter Determine whether the voice signal contains a preset wake word.
  • the smart TV can continue to use the second wake-up parameter to detect the input voice signal. In this way, after the smart TV is successfully awakened, the smart TV can still quickly and accurately recognize the voice signal input by the user, thereby improving the efficiency and accuracy of the voice interaction between the user and the smart TV.
  • the smart TV detects that the user has left the shooting range of the camera 501, for example, when the smart TV detects that there is no user image in the consecutive N frames of images recently captured by the camera 501, it indicates that the user has left the shooting range of the camera 501.
  • the smart TV can set the current wake-up parameter as the preset default wake-up parameter.
  • the smart TV can also continue to set the current wake-up parameter to the last dynamically determined wake-up parameter, and the embodiment of the present application does not impose any limitation on this.
  • the embodiment of the present application discloses an electronic device including a processor, and a memory, an input device, and an output device connected to the processor.
  • the input device may be a microphone, a touch sensor, a camera, etc.
  • the output device may be a display screen, a speaker, etc.
  • the input device and the output device may be integrated into one device.
  • a touch sensor may be used as an input device
  • a display screen may be used as an output device
  • the touch sensor and the display screen may be integrated into a touch screen.
  • the above electronic device may include: one or more processors 1202; one or more cameras 1205; one or more microphones 1206; a memory 1203; a display screen 1207; one or more application programs (Not shown); and one or more computer programs 1204, the above-mentioned devices can be connected through one or more communication buses 1201.
  • the one or more computer programs 1204 are stored in the aforementioned memory 1203 and configured to be executed by the one or more processors 1202, and the one or more computer programs 1204 include instructions, and the aforementioned instructions can be used to execute the aforementioned implementations.
  • the steps in the example Among them, all relevant content of the steps involved in the above method embodiments can be cited in the functional description of the corresponding physical device, which will not be repeated here.
  • the processor 1202 may be the processor 110 shown in FIG. 3, the memory 1203 may be the internal memory 121 shown in FIG. 3, and the camera 1205 may be the camera 140 shown in FIG. 3.
  • the microphone 1206 may specifically be the microphone 170B shown in FIG. 3, which is not limited in the embodiment of the present application.
  • the functional units in the various embodiments of the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • a computer readable storage medium includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音唤醒方法及电子设备,涉及终端技术领域,该方法包括:电子设备获取摄像头采集的图像(S401);若采集到的图像中包含用户图像,则确定用户所在的第一目标位置(S402);电子设备处理用户输入的第一语音;其中,若第一目标位置属于预设的第一区域,则电子设备使用第一参数处理第一语音;若第一目标位置属于预设的第二区域,则电子设备使用第二参数处理第一语音,第二区域与第一区域不同,第二参数与第一参数不同;若处理后的第一语音中包括预设的唤醒词,则电子设备从第一状态切换为第二状态。该方法可在较广的位置范围内保证电子设备能够成功被唤醒的几率,提高语音助手在各个位置下的唤醒率和用户的使用体验。

Description

一种语音唤醒方法及电子设备
本申请要求于2019年7月25日提交国家知识产权局、申请号为201910677390.9、发明名称为“一种语音唤醒方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及一种语音唤醒方法及电子设备。
背景技术
随着语音识别技术的发展,许多电子设备中添加了语音助手(例如Siri、小爱同学、小E等)来帮助用户完成与电子设备的人机交互过程。
为了使语音助手能够及时检测并响应用户发出的语音指令,电子设备中一般会预先设置一个或多个唤醒词(例如,“你好,小E”、“hi Siri”等)。当检测到用户输入预设的唤醒词时,说明用户此时具有语音交互的使用意图,因此,可触发电子设备启动语音助手与用户进行语音交流。
一般,电子设备可使用预先设置好的唤醒参数检测用户是否输入唤醒词。以声强门限为一个唤醒参数举例,电子设备可将唤醒词的声强门限设置为60dB。也就是说,当用户输入唤醒词时的声强大于60dB时,电子设备可确认用户输入了唤醒词从而唤醒语音助手。但是,当用户距离电子设备较远时,电子设备检测到用户输入的声强也随之降低,从而导致用户距离电子设备较远时无法唤醒语音助手的情况。
发明内容
本申请提供一种语音唤醒方法及电子设备,可在较广的位置范围内保证电子设备能够成功被唤醒的几率,提高语音助手在各个位置下的唤醒率和用户的使用体验。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请提供一种语音唤醒方法,包括:电子设备获取摄像头采集的图像;进而,电子设备可确定采集的图像中是否包括用户;如果该图像中包括用户,则电子设备可确定上述图像中第一用户所在的第一目标位置;后续,用户输入第一语音后,电子设备可根据第一目标位置处理该第一语音;例如,若第一目标位置属于预设的第一区域,则电子设备可使用第一参数处理第一语音;若第一目标位置属于预设的第二区域,则电子设备可使用第二参数处理第一语音;那么,若处理后的第一语音中包括预设的唤醒词,则电子设备的语音交互功能被唤醒,此时电子设备的语音交互功能从第一状态(例如待机状态)切换为第二状态(例如工作状态)。
也就是说,电子设备可以根据用户所在的位置动态的设置不同的参数检测语音中的唤醒词,使得用户在不同位置输入唤醒词时,电子设备可使用对应的参数检测用户输入的唤醒词,使得电子设备在不同位置场景下均能够保持较高的唤醒率,提高语音交互场景下用户的使用体验。
需要说明的是,电子设备处理第一语音的过程可以包括电子设备采集用户输入的第一语音的过程,也可以包括电子设备采集到第一语音后,对第一语音进行模数转换、降噪或信号放大等语音处理过程,本申请对此不做任何限制。
在一种可能的实现方式中,上述图像中可包括多个用户,此时,在电子设备确定该图像中第一用户所在的第一目标位置之前,还包括:电子设备从上述多个用户中确定第一用户。
示例性的,上述第一用户可以为上述多个用户中优先级最高的用户。
或者,上述第一用户可以为第一区域中的一个或多个用户,第一区域中的用户数量在预设的N个区域中最多,N为大于1的整数。也就是说,当某个区域中的用户数量最多时,该区域的优先级最高,该区域中的一个或多个用户即为第一用户。
在一种可能的实现方式中,上述第一参数可以包括:第一唤醒门限、第一拾音方向、第一噪声抑制参数以及第一放大增益中的一项或多项;类似的,上述第二参数可以包括:第二唤醒门限、第二拾音方向、第二噪声抑制参数以及第二放大增益中的一项或多项。这些参数是指能够影响电子设备唤醒率的一项或多项参数。
在一种可能的实现方式中,电子设备使用第一参数处理第一语音,包括:电子设备可使用第一唤醒门限判断是否需要处理第一语音;或;电子设备可使用第一拾音方向采集第一语音;或;电子设备可使用第一噪声抑制参数抑制第一语音中的噪声;或;电子设备可按照第一放大增益增强第一语音的响度。
在一种可能的实现方式中,电子设备使用第二参数处理第一语音,包括:电子设备可使用第二唤醒门限判断是否需要处理第一语音;或;电子设备可使用第二拾音方向采集第一语音;或;电子设备可使用第二噪声抑制参数抑制第一语音中的噪声;或;电子设备可按照第二放大增益增强第一语音的响度。
在一种可能的实现方式中,电子设备获取摄像头采集的图像,包括:当检测到电子设备上电、待机、开机或开始播放后,电子设备可以开始获取摄像头采集的每一帧图像。
第二方面,本申请提供一种电子设备,包括:一个或多个摄像头;一个或多个麦克风;一个或多个处理器;存储器以及一个或多个计算机程序;其中,处理器与摄像头、麦克风和存储器均耦合,上述一个或多个计算机程序被存储在存储器中,当电子设备运行时,该处理器执行该存储器存储的一个或多个计算机程序,以使电子设备执行上述任一项所述的语音唤醒方法。
第三方面,本申请提供一种计算机存储介质,包括计算机指令,当计算机指令在电子设备上运行时,使得电子设备执行如第一方面中任一项所述的语音唤醒方法。
第四方面,本申请提供一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行如第一方面中任一项所述的语音唤醒方法。
可以理解地,上述提供的第二方面所述的电子设备、第三方面所述的计算机存储介质,以及第四方面所述的计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1为一种语音交互过程的交互示意图;
图2为本申请实施例提供的一种语音唤醒方法的应用场景示意图一;
图3为本申请实施例提供的一种电子设备的结构示意图一;
图4为本申请实施例提供的一种语音唤醒方法的流程示意图;
图5为本申请实施例提供的一种语音唤醒方法的应用场景示意图二;
图6为本申请实施例提供的一种语音唤醒方法的应用场景示意图三;
图7为本申请实施例提供的一种语音唤醒方法的应用场景示意图四;
图8为本申请实施例提供的一种语音唤醒方法的应用场景示意图五;
图9为本申请实施例提供的一种语音唤醒方法的应用场景示意图六;
图10为本申请实施例提供的一种语音唤醒方法的应用场景示意图七;
图11为本申请实施例提供的一种语音唤醒方法的应用场景示意图八;
图12为本申请实施例提供的一种电子设备的结构示意图二。
具体实施方式
下面将结合附图对本实施例的实施方式进行详细描述。
本申请实施例提供的一种语音唤醒方法可应用于音箱、智能家居设备(例如智能电视、智能空调、智能冰箱等)、手机、平板电脑、笔记本电脑、上网本、个人数字助理(personal digital assistant,PDA)、可穿戴电子设备、车载设备或虚拟现实设备等具有语音交互功能的电子设备,本申请实施例对此不做任何限制。
一般,语音交互过程可被划分为五个环节,即唤醒、响应、输入、理解和反馈。在与电子设备进行语音交互前需要先唤醒电子设备。例如,用户可通过输入正确的唤醒词激活电子设备的语音交互功能,使电子设备的语音交互功能从待机状态(即第一状态)切换到工作状态(即第二状态)。
其中,当电子设备的语音交互功能处于待机状态时,电子设备接收到用户输入的语音信号后,需要对语音信号中的唤醒词进行识别。如果识别出预设的唤醒词,则电子设备可开启语音交互功能进入工作状态。当电子设备的语音交互功能处于工作状态时,电子设备在接收到用户输入的语音信号后,可通过语音识别算法识别该语音信号中的语义内容,从而响应该语音信号实现对应的功能。
以音箱为上述电子设备举例,如果用于唤醒音箱的唤醒词为“小艺小艺”,则用户与音箱进行语音交互前,需要先输入唤醒词“小艺小艺”唤醒音箱。示例性的,音箱可将麦克风设置为常开状态(always on),进而,音箱可通过麦克风实时检测用户输入的语音信号。如图1所示,当检测到用户输入唤醒词“小艺小艺”的语音信号后,音箱可唤醒音箱中的语音助手APP,将音箱从待机状态切换到工作状态。语音助手APP被唤醒后,可应答用户输入的唤醒词“小艺小艺”,并开始接收用户输入的语音指令。进而,音箱可通过与服务器交互理解用户输入的语音指令,并对用户输入的语音指令进行反馈,实现一次完整的语音交互流程。
可以看出,能够成功唤醒电子设备是实现用户与电子设备进行语音交互的基础。而成功唤醒电子设备的重要因素之一,是电子设备在检测用户输入唤醒词时使用的一项或多项参数(也可称为唤醒参数)。例如,该唤醒参数可以包括:唤醒门限、拾音方向、噪声抑制参数以及放大增益等一项或多项参数。这些参数的取值决定了电子设 备在检测唤醒词时唤醒率的高低。
以唤醒门限举例,唤醒门限是指能够成功唤醒电子设备的唤醒词的声强阈值。一般,电子设备在采集用户输入的唤醒词时,如果检测到用户输入的语音信号的声强大于唤醒门限,则电子设备可将该语音信号作为有效的语音信号,并继续检测该语音信号中是否包含唤醒词。反之,电子设备可将上述语音信号作为无效的语音信号丢弃。当唤醒门限的取值过高时,用户只能在距离电子设备较近的区域成功唤醒电子设备;但如果唤醒门限的取值过低,则用户在距离电子设备很远时也能唤醒电子设备,这会增加电子设备被误唤醒的几率。
以拾音方向举例,拾音方向是指电子设备采集用户输入唤醒词时接收语音信号的方向。如果用户在电子设备设置的拾音方向上输入唤醒词,则电子设备被成功唤醒的几率更大。反之,如果用户在拾音方向以外的区域输入唤醒词,则电子设备被成功唤醒的几率降低。
以噪声抑制参数举例,噪声抑制参数用于对拾音方向之外的噪声进行抑制。那么,如果用户在拾音方向之外的区域输入唤醒词,则电子设备使用上述噪声抑制参数进行检测时,可能会将用户此时输入的唤醒词作为噪音丢弃,导致用户输入唤醒词后无法成功唤醒电子设备。
以放大增益举例,放大增益是指电子设备检测唤醒词时对接收到的语音信号的放大倍数,用于增强语音信号的响度。与唤醒门限类似的,如果放大增益的取值过低,则用户只能在距离电子设备较近的区域成功唤醒电子设备;如果放大增益的取值过高,则用户在距离电子设备很远时也能唤醒电子设备,这会增加电子设备被误唤醒的几率。
一般,在电子设备出厂前,技术人员需要对唤醒参数中的各个参数进行测试,最终选择一组唤醒率满足要求的唤醒参数保存在电子设备中。后续,电子设备可使用已保存的这组唤醒参数实时检测用户输入的语音信号中是否包含正确的唤醒词,以唤醒电子设备启动语音交互功能进入工作状态。
在实际使用过程中,用户向电子设备输入唤醒词时所在的具体位置是随机变化的。以智能电视为电子设备举例,用户可以在家中的任意位置向电子设备输入唤醒词。但是,由于电子设备中保存的唤醒参数是固定的,因此,当用户在不同的位置向电子设备输入唤醒词时,电子设备接收并正确检测到该唤醒词从而被成功唤醒的几率仍然有所差异,导致用户在某些位置唤醒电子设备的唤醒率较低。
在本申请实施例中,电子设备可以根据用户所在的位置动态的设置当前的唤醒参数,使得用户在不同位置输入唤醒词时,电子设备可使用对应的唤醒参数检测用户输入的唤醒词,使得电子设备在不同位置场景下均能够保持较高的唤醒率。
以智能电视为电子设备举例,智能电视可通过摄像头采集用户的图像信息,进而根据采集到的图像确定用户当前的位置信息。对于不同的位置区域,智能电视可预先设置对应的唤醒参数。例如,如图2所示,距离智能电视1米以内的位置区域1与唤醒参数1对应,距离智能电视大于1米且小于等于2米的位置区域2与唤醒参数2对应,距离智能电视大于2米且小于等于3米的位置区域3与唤醒参数3对应。在位置区域1内使用唤醒参数1唤醒智能电视的唤醒率大于90%,在位置区域2内使用唤醒参数2唤醒智能电视的唤醒率大于90%,在位置区域3内使用唤醒参数3唤醒智能电 视的唤醒率也大于90%。
那么,如果用户当前的位置信息属于位置区域1,则智能电视可将唤醒参数设置为唤醒参数1。如果用户当前的位置信息属于位置区域2,则智能电视可将唤醒参数设置为唤醒参数2。如果用户当前的位置信息属于位置区域3,则智能电视可将唤醒参数设置为唤醒参数3。这样,无论用户走进哪个位置区域向智能电视输入唤醒词,智能电视均可使用对应的唤醒参数较为准确的检测到用户输入的唤醒词,从而成功唤醒智能电视与用户进行语音交互,提高了语音交互场景下用户的使用体验。
其中,电子设备检测用户位置信息以及动态配置唤醒参数的具体方式将在后续实施例中详细阐述,故此处不再赘述。可以理解的是,电子设备配置的唤醒参数除了包括上述唤醒门限、拾音方向、噪声抑制参数以及放大增益中的一项或多项外,还可以包括能够影响电子设备唤醒率的其他一项或多项参数,本申请实施例对此不做任何限制。
示例性的,图3示出了上述电子设备的结构示意图。
电子设备可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,摄像头140,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,麦克风170B,传感器模块180等。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
移动通信模块150可以提供应用在电子设备上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括一个或多个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器 110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
无线通信模块160可以提供应用在电子设备上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(Bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成一个或多个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储一个或多个计算机程序,该一个或多个计算机程序包括指令。处理器110可以通过运行存储在内部存储器121的上述指令,从而使得电子设备执行本申请一些实施例中所提供的联系人智能推荐的方法,以及各种功能应用和数据处理等。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统;该存储程序区还可以存储一个或多个应用程序(比如图库、联系人等)等。存储数据区可存储电子设备使用过程中所创建的数据(比如照片,联系人等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如一个或多个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。在另一些实施例中,处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,来使得电子设备执行本申请实施例中所提供的语音交互方法,以及各种功能应用和数据处理。
电子设备可以通过音频模块170,扬声器170A,麦克风170B,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备可以通过扬声器170A收听音乐,或收听免提通话。
麦克风170B,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170B发声,将声音信号输入到麦克风170B。电子设备可以设置一个或多个麦克风170B。在另一些实施例中,电子设备可以设置两个麦克风170B,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备还可以设置三个,四个或更多麦克风170B,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
传感器模块180可以包括压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,接近光传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,骨传导传感器等,本申请实施例对此不做任何限制。
在本申请实施例中,电子设备中还可以包括一个或多个摄像头140。
摄像头140可用于捕获图像,图像可以包括静态图片或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号转换成数字图像信号。
示例性的,摄像头140可将采集到的图像发送给处理器110。处理器110可通过人脸识别算法识别摄像头140采集到的图像中是否包含用户信息。当图像中包含用户信息时,说明用户已经进入摄像头140的采集区域。进而,处理器110可根据摄像头140实时采集到的用户图像确定用户的位置信息。
例如,处理器110可以提取图像中的用户头像,进而,处理器110可根据用户头像在整个图像中的比例,计算用户与电子设备之间的距离。又例如,摄像头140可以为3D深度摄像头。3D深度摄像头采集到的图像中包含景物的深度信息。那么,处理器110识别出图像中的用户图像后,可获取用户的深度信息,从而确定出用户与电子设备之间的距离。
当然,电子设备还可以结合声源定位等技术确定用户的位置信息,本申请实施例对此不做任何限制。
可以理解的是,本发明实施例示意的结构并不构成对电子设备的具体限定。在本申请另一些实施例中,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
例如,当上述电子设备为音箱时,电子设备中还可以包括GPU、显示屏以及按键等一项或多项器件,本申请实施例对此不做任何限制。
又例如,当上述电子设备为智能电视时,还可以为电子设备配备遥控器、红外传感器等一项或多项器件,本申请实施例对此不做任何限制。
又例如,当上述电子设备为手机时,电子设备中还可以包括GPU、显示屏、耳机接口、按键、电池、马达、指示器以及SIM卡接口等一项或多项器件,本申请实施例对此不做任何限制。
以下,将结合附图对本申请实施例提供的一种语音唤醒方法进行具体介绍。以下实施例中均以智能电视作为语音交互时的电子设备举例说明。
图4为本申请实施例提供的一种语音唤醒方法的流程示意图。如图4所示,该语音唤醒方法可以包括:
S401、智能电视使用摄像头开始采集图像。
在本申请实施例中,可在具有语音交互功能的电子设备(例如智能电视)中安装摄像头。摄像头可用于采集智能电视周围一定拍摄范围内的图像。智能电视根据采集到的图像中的用户信息可对用户进行定位。
示例性的,智能电视可以在上电后自动打开摄像头开始采集图像。或者,智能电 视可以在进入待机状态后自动打开摄像头开始采集图像。又或者,智能电视可以在开机或开始播放后自动打开摄像头开始采集图像,本申请实施例对此不做任何限制。
一般,摄像头都具有一定的视场角(field of view,FOV)。如图5所示,摄像头501的FOV是指以摄像头501的镜头为顶点,以被测物体可通过镜头的最大范围的两条边缘构成的夹角α。FOV的大小决定了摄像头501的视野范围。摄像头501的FOV越大,则摄像头501能够拍摄到的视野范围就越大。相应的,当目标物体超过摄像头501的FOV后就不会被拍摄在拍摄画面中。
以智能电视中摄像头501的FOV为90°举例。智能电视打开摄像头501后,可实时采集摄像头501捕捉到的每一帧图像。由于摄像头501的FOV为90°,因此,智能电视采集到的每一帧图像可监测到FOV在90°内(后续可称为拍摄范围内)的画面内容。
智能电视得到摄像头501采集到的每一帧图像后,可使用预设的人脸识别算法识别采集到的图像中是否包含用户图像。例如,智能电视可识别图像中是否存在眼、口、鼻等关键的人脸特征,如果存在这些人脸特征,则智能电视可确定采集到的图像中包含人像(即包含用户图像)。又或者,智能电视还可以识别采集到的图像中是否包含特定用户的用户图像。例如,用户Alice可预先将自己的人脸图像输入至智能电视中,智能电视采集到每一帧图像后,可识别图像中是否包含用户Alice的人脸图像。如果包含用户Alice的人脸图像,则智能电视可确定采集到的图像中包含用户图像。
当然,智能电视采集到的图像中可以包含一个或多个用户图像,或者,智能电视采集到的图像中也可能不包含用户图像,本申请实施例对此不做任何限制。
S402、若采集到的图像中包含用户图像,则智能电视确定用户所在的第一目标位置。
如果智能电视采集到的图像中包含用户图像,说明用户已经进入摄像头501的拍摄范围内,此时用户有可能需要使用语音交互的方式控制智能电视。为了能够使用户后续能够成功唤醒智能电视,智能电视可根据采集到的用户图像确定用户此时所在的第一目标位置。
示例性的,如图6中的(a)所示,智能电视可对其摄像头501的拍摄范围预先设置一个平面坐标系。例如,该平面坐标系可以摄像头501所在的位置为原点O,以垂直于智能电视屏幕的方向为y轴,以平行于智能电视屏幕的方向为x轴。当用户出现在拍摄范围内的任意一点时,用户所在的第一目标位置可以用上述平面坐标系中的一个坐标表示。
并且,如图6中的(b)所示,为摄像头501采集到的一帧图像602。当用户出现在拍摄范围内的不同位置时,用户图像601出现在图像602中的位置也会对应改变。也就是说,图像602中用户图像601的位置与上述平面坐标系中用户所在的坐标一一对应。那么,基于上述对应关系,智能电视可以根据采集到的图像602中用户图像601的位置,确定在上述平面坐标系中用户所在的第一目标位置的坐标A(X1,Y1)。其中,用户图像601可以包括用户的人脸图像。或者,用户图像601也可以包括用户身体的图像。
当然,智能电视还可以使用其他定位方法确定用户所在的第一目标位置。例如, 智能电视还可以通过多个麦克风采集用户的声音信号,进而,智能电视可基于声源定位定位技术,根据采集到的声音信号的方向和强度等参数计算用户所在的第一目标位置,本申请实施例对此不做任何限制。
S403、智能电视获取与第一目标位置对应的第一唤醒参数。
在本申请实施例中,可预先将智能电视中摄像头501的拍摄范围划分为多个区域(后续称为唤醒区域)。并且,可预先为每个唤醒区域设置一组对应的参数(即唤醒参数)。其中,该唤醒参数是指能够影响智能电视唤醒率的一项或多项参数。例如,该唤醒参数可以包括唤醒门限、拾音方向、噪声抑制参数以及放大增益等一项或多项,本申请实施例对此不做任何限制。
示例性的,如图7所示,摄像头501的拍摄范围包括4个唤醒区域(即唤醒区域1至唤醒区域4)。其中,唤醒区域1至唤醒区域4距离智能电视的距离依次增加。对于每个唤醒区域,开发人员可以通过测试的方式确定在该唤醒区域内能够保证智能电视唤醒率较高的唤醒参数。例如,可以为唤醒区域1设置对应的唤醒参数1,为唤醒区域2设置对应的唤醒参数2,为唤醒区域3设置对应的唤醒参数3,并且,为唤醒区域4设置对应的唤醒参数4,从而保证智能电视在各个位置的唤醒率均保持在较高的水平。
那么,在步骤S403中,智能电视通过步骤S402获取到用户所在的第一目标位置的坐标A(X1,Y1)后,智能电视可进一步确定坐标A(X1,Y1)所属的唤醒区域。以坐标A(X1,Y1)位于上述唤醒区域3举例,为了使用户在坐标A处能够成功唤醒智能电视,智能电视可以将当前的唤醒参数设置为与唤醒区域3对应的唤醒参数3。
示例性的,智能电视中可以设置一个变量H,变量H的取值用于标识正在使用的唤醒参数。智能电视可设置一组默认唤醒参数,当摄像头501未检测到用户图像或检测到用户离开拍摄范围时,智能电视可将变量H设置为上述默认唤醒参数。当检测到用户所在的位置位于预设的某一唤醒区域时,智能电视可将变量H设置为与该唤醒区域对应的唤醒参数(例如上述唤醒参数3)。也就是说,通过改变变量H的取值,智能电视可以根据用户的位置动态的调整正在使用的唤醒参数。
需要说明的是,本领域技术人员可以根据实际经验或实际应用场景对上述拍摄区域内的唤醒区域进行划分,每个唤醒区域的大小和形状可以相同或不同。本申请实施例对此不做任何限制。
示例性的,如图8中的(a)所示,可以按照麦克风的收音方向将拍摄区域划分为多个唤醒区域。同样,可以为每个唤醒区域设置对应的唤醒参数,以保证在该唤醒区域内电子设备有较高的唤醒率。又例如,可以按照距离智能电视的远近和偏远程度将拍摄区域划分为多个唤醒区域。如图8中的(b)所示,唤醒区域1为距离智能电视较近且位于中心的区域;唤醒区域2相对于唤醒区域1为距离智能电视较远且较为偏远的区域;唤醒区域3相对于唤醒区域2为距离智能电视更远且更为偏远的区域。同样,可为每个唤醒区域设置对应的唤醒参数,以保证在该唤醒区域内电子设备有较高的唤醒率。
另外,上述实施例中是以摄像头501的拍摄范围内仅有一个用户的场景举例说明的。当摄像头501的拍摄范围内通过出现多个用户时,智能电视也可以通过确定第一 目标位置,设置与第一目标位置对应的第一唤醒参数。
示例性的,如图9所示,摄像头501的拍摄范围内包括用户A、用户B以及用户C。那么,摄像头501采集到的图像中同时包含用户A、用户B以及用户C的用户图像。
在一些实施例中,智能电视获取到包含用户A、用户B以及用户C的图像后,可在用户A、用户B以及用户C中确定优先级最高的用户。进而,智能电视可将优先级最高的用户所在的位置确定为第一目标位置,从而确定与第一目标位置对应的第一唤醒参数。
例如,智能电视可以保存最近一次发起播放任务的具体用户。例如,如果最近执行开机任务的用户为用户A,则智能电视可以将用户A确定为用户A、用户B以及用户C中优先级最高的用户。
又例如,智能电视可以保存最近一段时间内每一次唤醒智能电视的唤醒用户,从而统计出唤醒智能电视次数最多的用户。如果用户B为唤醒智能电视次数最多的用户,则智能电视可以将用户B确定为用户A、用户B以及用户C中优先级最高的用户。
又例如,智能电视可以保存一个或多个用户的人脸信息或声纹信息。那么,如果智能电视检测到用户A、用户B以及用户C中某一用户的声纹信息(或人脸信息)与预先存储的用户的声纹信息(或人脸信息)匹配,则智能电视可将该用户确定为优先级最高的用户。
又例如,智能电视还可以使用人脸识别技术,根据采集到的用户A、用户B以及用户C的用户头像识别用户年龄。进而,智能电视可以将年龄最大的用户确定为优先级最高的用户,从而降低低龄用户随意唤醒智能电视的几率。
当然,智能电视还可以将距离智能电视最近,或距离智能电视最远,或声音信号的强度最大,或声音信号的强度最小的用户确定为优先级最高的用户,本申请实施例对此不做任何限制。
以用户C为优先级最高的用户举例,仍如图9所示,智能电视可根据用户C所在的位置(即第一目标位置)确定用户C当前位于唤醒区域3内。为了保证用户C在唤醒区域3内能够成功唤醒智能电视,智能电视可将当前的唤醒参数设置为与唤醒区域3对应的唤醒参数3。
在另一些实施例中,智能电视获取到包含用户A、用户B以及用户C的图像后,可统计每个唤醒区域中的用户数目。进而,智能电视可将用户数目最多的唤醒区域作为第一目标位置,并且,将当前的唤醒参数设置为与第一目标位置对应的唤醒参数。
仍如图9所示的,智能电视获取到包含用户A、用户B以及用户C的图像后,可统计唤醒区域1至唤醒区域3中的用户数目。其中,唤醒区域1内包括1名用户,唤醒区域2内没有用户,唤醒区域3内包括2名用户。那么,智能电视可将唤醒区域3确定为此时用户所在的第一目标区域。为了保证用户在唤醒区域3内能够成功唤醒智能电视,智能电视可将当前的唤醒参数设置为与唤醒区域3对应的唤醒参数3。
另外,如果智能电视统计出各个唤醒区域内的用户数目相同。例如,上述唤醒区域1至唤醒区域3中分别包括1名用户。此时,智能电视可按照上述实施例中的方法确定这3名用户中优先级最高的用户。进而,智能电视可将优先级最高的用户所在的 位置确定为第一目标位置,从而确定与第一目标位置对应的第一唤醒参数。
至此,通过步骤S402-S403,智能电视通过采集到的用户图像可以识别出用户所在的第一目标位置,从而将当前的唤醒参数设置为与第一目标位置对应的唤醒参数,以提高用户在该位置处唤醒智能电视的成功几率。
S404、智能电视使用第一唤醒参数检测用户输入的语音信号中是否包含唤醒词。
智能电视将当前的唤醒参数设置为与用户所在位置对应的第一唤醒参数后,便可按照该第一唤醒参数实时采集用户输入的语音信号,并按照该第一唤醒参数确定该语音信号中是否包含预设的唤醒词。
例如,智能电视可以按照第一唤醒参数中的第一拾音方向和第一噪声抑制采集用户本次输入的语音信号。进而,智能电视可按照第一唤醒参数中的第一放大增益增强上述语音信号的响度。进而,智能电视可按照第一唤醒参数中的第一唤醒门限判断上述语音信号是否为有效的语音输入。如果上述语音信号的响度大于第一唤醒门限,说明该语音信号是否为有效的语音输入,则智能电视可据继续识别该语音信号中是够包含预设的唤醒词。
以第一唤醒参数为图9中与唤醒区域3对应的唤醒参数3举例,用户走入唤醒区域3后,智能电视根据用户的位置信息可将当前的唤醒参数设置为唤醒参数3。进而,智能电视可按照唤醒参数3实时采集当前的语音信号,如图10所示,如果用户向智能电视输入“小艺小艺”的语音信号,由于唤醒参数3中设置的唤醒门限、拾音方向、噪声抑制参数以及放大增益等参数均为与唤醒区域3对应的参数,因此,智能电视使用唤醒参数3能够更加快速、准确的检测到用户在唤醒区域3输入的语音信号中包含唤醒词“小艺小艺”。智能电视检测到用户输入正确的唤醒词后可开启语音助手APP与用户进行语音交互。
可以看出,通过用户位置为智能电视动态设置对应的唤醒参数后,可以在用户还没有发声前就将智能电视调整为唤醒率较高的水平。用户发声输入唤醒词后,智能电视可使用对应的唤醒参数快速、准确的识别出用户输入的唤醒词,从而唤醒智能电视。这样,无论用户走进哪个唤醒区域向智能电视输入唤醒词,智能电视均可使用对应的唤醒参数较为准确的检测到用户输入的唤醒词,从而成功唤醒智能电视与用户进行语音交互,提高了语音交互场景下用户的使用体验。
在一些实施例中,当智能电视被成功唤醒后,如果检测到用户还没有离开当前的唤醒区域,则智能电视可继续使用步骤S403中设置的第一唤醒参数检测用户输入的语音信号。例如,用户在唤醒区域3唤醒智能电视后,可继续输入“播放新闻”的语音信号。这样,智能电视被成功唤醒后,智能电视仍然可以快速、准确的识别出用户输入的语音信号,从而提高用户与智能电视在语音交互过程中的效率和准确性。
在一些场景下,用户与智能电视进行语音交互时可能会改变自身所处的位置。由于智能电视的摄像头501可以实时捕捉拍摄范围内的图像。当用户在图像中的位置发生变化时,说明用户改变了自身所在的位置,那么,智能电视可继续执行下述步骤S405-S407。
S405、智能电视检测到用户从第一目标位置切换至第二目标位置。
用户在智能电视的摄像头501的拍摄范围内移动时,摄像头501可以实时获取到 包含用户图像的拍摄画面。如图11所示,如果检测到最新获取的相邻N帧拍摄画面内用户图像601从拍摄画面的P点移动到Q点,且P点与Q点之间的距离大于阈值,则智能电视可确定用户从第一目标位置移动至第二目标位置。
需要说明的是,上述P点(或Q点)可以理解为拍摄画面中用户所在的位置。其中,P点(或Q点)为用户图像601中的任意一点。例如,可将拍摄画面中用户人脸的中心点作为上述P点(或Q点)。又例如,可将拍摄画面中用户身体的中心点作为上述P点(或Q点)。又例如,可将拍摄画面中用户所站(或所坐)的位置作为上述P点(或Q点),本申请实施例对此不做任何限制。
与步骤S402类似的,确定用户位置发生变化后,智能电视可以根据采集到最新一帧图像中用户图像的位置,确定在预设的平面坐标系中用户当前所在的第二目标位置的坐标B(X2,Y2)。
又例如,智能电视也可以根据摄像头501采集到的每一帧包含用户图像的拍摄画面对用户位置进行定位,或者,智能电视也可以周期性的获取包含用户图像的拍摄画面,并对用户位置进行定位。如果本次确定出的用户位置的坐标与上一次确定出的用户位置的坐标不同,则说明用户从第一目标位置切换至第二目标位置。
S406、智能电视获取与第二目标位置对应的第二唤醒参数。
与步骤S403类似的,智能电视可根据用户当前所在的第二目标位置的坐标B,确定第二目标位置所属的唤醒区域。如果第二目标位置所属的唤醒区域与步骤S403中第一目标位置所属的唤醒区域相同,例如,该唤醒区域均为图9所示的唤醒区域3,则智能电视无需修改当前的唤醒参数,当前的唤醒参数仍为与唤醒区域3对应的唤醒参数3。
相应的,如果第二目标位置所属的唤醒区域与步骤S403中第一目标位置所属的唤醒区域不同,例如,第一目标位置属于图9所示的唤醒区域3,而第二目标位置属于图9所示的唤醒区域2。那么,智能电视可将当前的唤醒参数修改为与唤醒区域2对应的唤醒参数2(即第二唤醒参数),从而保证用户在第二目标位置处唤醒智能电视的成功几率。
S407、智能电视使用第二唤醒参数检测用户输入的语音信号中是否包含唤醒词。
与步骤S404类似的,智能电视将当前的唤醒参数设置为与用户所在位置对应的第二唤醒参数后,便可按照该第二唤醒参数实时采集用户输入的语音信号,并按照该第二唤醒参数确定该语音信号中是否包含预设的唤醒词。
当然,如果用户从第一目标位置移动至第二目标位置时,智能电视中的语音助手APP已经处于被唤醒的状态,则智能电视可继续使用第二唤醒参数检测输入的语音信号。这样,智能电视被成功唤醒后,智能电视仍然可以快速、准确的识别出用户输入的语音信号,从而提高用户与智能电视在语音交互过程中的效率和准确性。
另外,如果智能电视检测到用户离开摄像头501的拍摄范围,例如,智能电视检测到摄像头501最近拍摄的连续N帧图像中不存在用户图像时,说明用户已经离开摄像头501的拍摄范围。此时,智能电视可将当前的唤醒参数设置为上述预设的默认唤醒参数。当然,智能电视也可以将当前的唤醒参数继续设置为最近一次动态确定出的唤醒参数,本申请实施例对此不做任何限制。
本申请实施例公开了一种电子设备,包括处理器,以及与处理器相连的存储器、输入设备和输出设备。其中,输入设备可以为麦克风、触摸传感器、摄像头等;输出设备可以为显示屏、扬声器等。示例性的,输入设备和输出设备可集成为一个设备,例如,可将触摸传感器作为输入设备,将显示屏作为输出设备,并将触摸传感器和显示屏集成为触摸屏。
此时,如图12所示,上述电子设备可以包括:一个或多个处理器1202;一个或多个摄像头1205;一个或多个麦克风1206;存储器1203;显示屏1207;一个或多个应用程序(未示出);以及一个或多个计算机程序1204,上述各器件可以通过一个或多个通信总线1201连接。其中该一个或多个计算机程序1204被存储在上述存储器1203中并被配置为被该一个或多个处理器1202执行,该一个或多个计算机程序1204包括指令,上述指令可以用于执行上述实施例中的各个步骤。其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应实体器件的功能描述,在此不再赘述。
示例性的,上述处理器1202具体可以为图3所示的处理器110,上述存储器1203具体可以为图3所示的内部存储器121,上述摄像头1205具体可以为图3所示的摄像头140,上述麦克风1206具体可以为图3所示的麦克风170B,本申请实施例对此不做任何限制。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何在本申请实施例揭露的技术范围内的变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种语音唤醒方法,其特征在于,包括:
    电子设备获取摄像头采集的图像;
    所述电子设备确定所述图像包括用户;
    响应于所述电子设备确定所述图像包括所述用户,所述电子设备确定所述图像中第一用户所在的第一目标位置;
    所述电子设备处理用户输入的第一语音;其中,若所述第一目标位置属于预设的第一区域,则所述电子设备使用第一参数处理所述第一语音;若所述第一目标位置属于预设的第二区域,则所述电子设备使用第二参数处理所述第一语音,所述第二区域与所述第一区域不同,所述第二参数与所述第一参数不同;
    若处理后的所述第一语音中包括预设的唤醒词,则所述电子设备从第一状态切换为第二状态。
  2. 根据权利要求1所述的方法,其特征在于,所述图像包括多个用户,在所述电子设备确定所述图像中第一用户所在的第一目标位置之前,还包括:
    所述电子设备从所述多个用户中确定所述第一用户。
  3. 根据权利要求2所述的方法,其特征在于,所述第一用户为所述多个用户中优先级最高的用户。
  4. 根据权利要求2所述的方法,其特征在于,若所述第一区域中的用户数量在预设的N个区域中最多,则所述第一用户为所述第一区域中的一个或多个用户,N为大于1的整数。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,
    所述第一参数包括:第一唤醒门限、第一拾音方向、第一噪声抑制参数以及第一放大增益中的一项或多项;
    所述第二参数包括:第二唤醒门限、第二拾音方向、第二噪声抑制参数以及第二放大增益中的一项或多项。
  6. 根据权利要求5所述的方法,其特征在于,所述电子设备使用第一参数处理所述第一语音,包括:
    所述电子设备使用所述第一唤醒门限判断是否处理所述第一语音;或;
    所述电子设备使用所述第一拾音方向采集所述第一语音;或;
    所述电子设备使用所述第一噪声抑制参数抑制所述第一语音中的噪声;或;
    所述电子设备按照所述第一放大增益增强所述第一语音的响度。
  7. 根据权利要求5所述的方法,其特征在于,所述电子设备使用第二参数处理所述第一语音,包括:
    所述电子设备使用所述第二唤醒门限判断是否处理所述第一语音;或;
    所述电子设备使用所述第二拾音方向采集所述第一语音;或;
    所述电子设备使用所述第二噪声抑制参数抑制所述第一语音中的噪声;或;
    所述电子设备按照所述第二放大增益增强所述第一语音的响度。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述电子设备获取摄像头采集的图像,包括:
    当检测到所述电子设备上电、待机、开机或开始播放后,所述电子设备开始获取摄像头采集的每一帧图像。
  9. 一种电子设备,其特征在于,包括:
    一个或多个摄像头;
    一个或多个麦克风;
    一个或多个处理器;
    存储器;
    其中,所述存储器中存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述电子设备执行时,使得所述电子设备执行以下步骤:
    获取所述摄像头采集的图像;
    确定所述图像包括用户;
    响应于所述电子设备确定所述图像包括所述用户,确定所述图像中第一用户所在的第一目标位置;
    处理用户输入的第一语音;其中,若所述第一目标位置属于预设的第一区域,则使用第一参数处理所述第一语音;若所述第一目标位置属于预设的第二区域,则使用第二参数处理所述第一语音,所述第二区域与所述第一区域不同,所述第二参数与所述第一参数不同;
    若处理后的所述第一语音中包括预设的唤醒词,则从第一状态切换为第二状态。
  10. 根据权利要求9所述的电子设备,其特征在于,所述图像包括多个用户,在所述电子设备确定所述图像中第一用户所在的第一目标位置之前,所述电子设备还用于执行:
    从所述多个用户中确定所述第一用户。
  11. 根据权利要求10所述的电子设备,其特征在于,所述第一用户为所述多个用户中优先级最高的用户。
  12. 根据权利要求10所述的电子设备,其特征在于,若所述第一区域中的用户数量在预设的N个区域中最多,所述第一用户为所述第一区域中的一个或多个用户,N为大于1的整数。
  13. 根据权利要求9-12中任一项所述的电子设备,其特征在于,所述第一参数包括:第一唤醒门限、第一拾音方向、第一噪声抑制参数以及第一放大增益中的一项或多项;所述第二参数包括:第二唤醒门限、第二拾音方向、第二噪声抑制参数以及第二放大增益中的一项或多项。
  14. 根据权利要求13所述的电子设备,其特征在于,所述电子设备使用第一参数处理所述第一语音,具体包括:
    使用所述第一唤醒门限判断是否处理所述第一语音;或;
    使用所述第一拾音方向采集所述第一语音;或;
    使用所述第一噪声抑制参数抑制所述第一语音中的噪声;或;
    按照所述第一放大增益增强所述第一语音的响度。
  15. 根据权利要求13所述的电子设备,其特征在于,所述电子设备使用第二参数处理所述第一语音,具体包括:
    使用所述第二唤醒门限判断是否处理所述第一语音;或;
    使用所述第二拾音方向采集所述第一语音;或;
    使用所述第二噪声抑制参数抑制所述第一语音中的噪声;或;
    按照所述第二放大增益增强所述第一语音的响度。
  16. 根据权利要求9-15中任一项所述的电子设备,其特征在于,电子设备获取摄像头采集的图像,具体包括:
    当检测到所述电子设备上电、待机、开机或开始播放后,开始获取摄像头采集的每一帧图像。
  17. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-8中任一项所述的语音唤醒方法。
  18. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-8中任一项所述的语音唤醒方法。
PCT/CN2020/103130 2019-07-25 2020-07-20 一种语音唤醒方法及电子设备 WO2021013137A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910677390.9A CN110415695A (zh) 2019-07-25 2019-07-25 一种语音唤醒方法及电子设备
CN201910677390.9 2019-07-25

Publications (1)

Publication Number Publication Date
WO2021013137A1 true WO2021013137A1 (zh) 2021-01-28

Family

ID=68363209

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/103130 WO2021013137A1 (zh) 2019-07-25 2020-07-20 一种语音唤醒方法及电子设备

Country Status (2)

Country Link
CN (1) CN110415695A (zh)
WO (1) WO2021013137A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724704A (zh) * 2021-08-30 2021-11-30 深圳创维-Rgb电子有限公司 一种语音获取方法、装置、终端及存储介质
CN114333017A (zh) * 2021-12-29 2022-04-12 阿波罗智联(北京)科技有限公司 一种动态拾音方法、装置、电子设备及存储介质
CN115171699A (zh) * 2022-05-31 2022-10-11 青岛海尔科技有限公司 唤醒参数的调整方法和装置、存储介质及电子装置
CN117711395A (zh) * 2023-06-30 2024-03-15 荣耀终端有限公司 语音交互方法及电子设备

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110415695A (zh) * 2019-07-25 2019-11-05 华为技术有限公司 一种语音唤醒方法及电子设备
CN110706707B (zh) * 2019-11-13 2020-09-18 百度在线网络技术(北京)有限公司 用于语音交互的方法、装置、设备和计算机可读存储介质
CN110933345B (zh) * 2019-11-26 2021-11-02 深圳创维-Rgb电子有限公司 一种降低电视待机功耗的方法、电视机及存储介质
CN113141285B (zh) * 2020-01-19 2022-04-29 海信集团有限公司 一种沉浸式语音交互方法及系统
CN113452583B (zh) * 2020-03-25 2023-06-02 阿里巴巴集团控股有限公司 账户切换方法和系统、存储介质及处理设备
CN111862947A (zh) * 2020-06-30 2020-10-30 百度在线网络技术(北京)有限公司 用于控制智能设备的方法、装置、电子设备和计算机存储介质
CN111951787A (zh) * 2020-07-31 2020-11-17 北京小米松果电子有限公司 语音输出方法、装置、存储介质和电子设备
CN112650086A (zh) * 2020-08-27 2021-04-13 合肥恒烁半导体有限公司 一种mcu芯片唤醒电路
CN112201257A (zh) * 2020-09-29 2021-01-08 北京百度网讯科技有限公司 基于声纹识别的信息推荐方法、装置、电子设备及存储介质
CN112164405B (zh) * 2020-11-05 2024-04-23 佛山市顺德区美的电子科技有限公司 语音设备及其唤醒方法、装置以及存储介质
CN114566171A (zh) * 2020-11-27 2022-05-31 华为技术有限公司 一种语音唤醒方法及电子设备
CN116360583A (zh) * 2021-12-28 2023-06-30 华为技术有限公司 设备控制方法及相关装置
CN114245267B (zh) * 2022-02-27 2022-07-08 北京荣耀终端有限公司 多设备协同工作的方法、系统及电子设备

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160027576A (ko) * 2014-09-01 2016-03-10 유형근 얼굴인식형 인터랙티브 디지털 사이니지장치
CN106096373A (zh) * 2016-06-27 2016-11-09 旗瀚科技股份有限公司 机器人与用户的交互方法及装置
CN106127156A (zh) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 基于声纹和人脸识别的机器人交互方法
US20190057716A1 (en) * 2011-07-18 2019-02-21 Nuance Communications, Inc. System and Method for Enhancing Speech Activity Detection Using Facial Feature Detection
CN109587552A (zh) * 2018-11-26 2019-04-05 Oppo广东移动通信有限公司 视频人物音效处理方法、装置、移动终端及存储介质
CN109710080A (zh) * 2019-01-25 2019-05-03 华为技术有限公司 一种屏幕控制和语音控制方法及电子设备
US20190198007A1 (en) * 2017-12-26 2019-06-27 International Business Machines Corporation Initiating synthesized speech outpout from a voice-controlled device
CN109949810A (zh) * 2019-03-28 2019-06-28 华为技术有限公司 一种语音唤醒方法、装置、设备及介质
CN109976506A (zh) * 2017-12-28 2019-07-05 深圳市优必选科技有限公司 一种电子设备的唤醒方法、存储介质及机器人
CN110415695A (zh) * 2019-07-25 2019-11-05 华为技术有限公司 一种语音唤醒方法及电子设备

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190057716A1 (en) * 2011-07-18 2019-02-21 Nuance Communications, Inc. System and Method for Enhancing Speech Activity Detection Using Facial Feature Detection
KR20160027576A (ko) * 2014-09-01 2016-03-10 유형근 얼굴인식형 인터랙티브 디지털 사이니지장치
CN106096373A (zh) * 2016-06-27 2016-11-09 旗瀚科技股份有限公司 机器人与用户的交互方法及装置
CN106127156A (zh) * 2016-06-27 2016-11-16 上海元趣信息技术有限公司 基于声纹和人脸识别的机器人交互方法
US20190198007A1 (en) * 2017-12-26 2019-06-27 International Business Machines Corporation Initiating synthesized speech outpout from a voice-controlled device
CN109976506A (zh) * 2017-12-28 2019-07-05 深圳市优必选科技有限公司 一种电子设备的唤醒方法、存储介质及机器人
CN109587552A (zh) * 2018-11-26 2019-04-05 Oppo广东移动通信有限公司 视频人物音效处理方法、装置、移动终端及存储介质
CN109710080A (zh) * 2019-01-25 2019-05-03 华为技术有限公司 一种屏幕控制和语音控制方法及电子设备
CN109949810A (zh) * 2019-03-28 2019-06-28 华为技术有限公司 一种语音唤醒方法、装置、设备及介质
CN110415695A (zh) * 2019-07-25 2019-11-05 华为技术有限公司 一种语音唤醒方法及电子设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724704A (zh) * 2021-08-30 2021-11-30 深圳创维-Rgb电子有限公司 一种语音获取方法、装置、终端及存储介质
CN114333017A (zh) * 2021-12-29 2022-04-12 阿波罗智联(北京)科技有限公司 一种动态拾音方法、装置、电子设备及存储介质
CN115171699A (zh) * 2022-05-31 2022-10-11 青岛海尔科技有限公司 唤醒参数的调整方法和装置、存储介质及电子装置
CN117711395A (zh) * 2023-06-30 2024-03-15 荣耀终端有限公司 语音交互方法及电子设备

Also Published As

Publication number Publication date
CN110415695A (zh) 2019-11-05

Similar Documents

Publication Publication Date Title
WO2021013137A1 (zh) 一种语音唤醒方法及电子设备
US20220223150A1 (en) Voice wakeup method and device
CN110364151B (zh) 一种语音唤醒的方法和电子设备
CN109213732B (zh) 一种改善相册分类的方法、移动终端及计算机可读存储介质
CN112289313A (zh) 一种语音控制方法、电子设备及系统
US20210319782A1 (en) Speech recognition method, wearable device, and electronic device
CN111369988A (zh) 一种语音唤醒方法及电子设备
US20220394454A1 (en) Data Processing Method, BLUETOOTH Module, Electronic Device, and Readable Storage Medium
CN111696562B (zh) 语音唤醒方法、设备及存储介质
US11620995B2 (en) Voice interaction processing method and apparatus
WO2020019355A1 (zh) 一种可穿戴设备的触控方法、可穿戴设备及系统
WO2021180085A1 (zh) 拾音方法、装置和电子设备
CN110096865B (zh) 下发验证方式的方法、装置、设备及存储介质
CN112634895A (zh) 语音交互免唤醒方法和装置
WO2022161077A1 (zh) 语音控制方法和电子设备
WO2020077508A1 (zh) 一种对内部存储器动态调频的方法及电子设备
CN111625175B (zh) 触控事件处理方法、触控事件处理装置、介质与电子设备
CN114520002A (zh) 一种处理语音的方法及电子设备
WO2023197709A1 (zh) 器件识别方法和相关装置
CN109285563B (zh) 在线翻译过程中的语音数据处理方法及装置
EP4258259A1 (en) Wakeup method and electronic device
CN114116610A (zh) 获取存储信息的方法、装置、电子设备和介质
CN111681654A (zh) 语音控制方法、装置、电子设备及存储介质
WO2023005844A1 (zh) 设备唤醒方法、相关装置及通信系统
US20240111478A1 (en) Video Recording Method and Electronic Device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20844084

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20844084

Country of ref document: EP

Kind code of ref document: A1