WO2022068694A1

WO2022068694A1 - Electronic device and wake-up method thereof

Info

Publication number: WO2022068694A1
Application number: PCT/CN2021/120305
Authority: WO
Inventors: 孙渊; 屈伸; 许天亮
Original assignee: 华为技术有限公司
Priority date: 2020-09-30
Filing date: 2021-09-24
Publication date: 2022-04-07
Also published as: CN114360546A

Abstract

Provided is a wake-up method, comprising: receiving a sound, and calculating the confidence of a wake-up word for the sound (801); if the wake-up word confidence is greater than or equal to a first threshold, then calculating a sound source orientation of the sound (802), and determining whether the sound source orientation is in a first orientation set or a second orientation set (803); if the sound source orientation is in the first orientation set, then according to the confidence of a first orientation corresponding to a first orientation in the first orientation set, determining whether to wake up the electronic device (804); if the sound source orientation is in the second orientation set, then according to the confidence of a second orientation corresponding to a second orientation in the second orientation set, determining whether to wake up the electronic device (805). The method reduces the probability of erroneous wake-up of the electronic device and improves user experience. Also provided are an electronic device (100) and a computer storage medium.

Description

Electronic device and wake-up method thereof

technical field

The present application relates to the field of terminal technologies, and in particular, to an electronic device and a wake-up method thereof.

Background technique

Electronic devices can perform functions through voice interaction with the user. Such electronic devices include pickups (eg, microphone arrays) and speakers (eg, speakers), with pickup and playback functions. For example, smart speakers, smart phones, smart TVs, etc. Before the user interacts with the electronic device by voice, the electronic device needs to be woken up. By waking up, the electronic device can enter the working state from the standby state. Generally, the electronic device determines whether to wake up by recognizing whether the received sound contains a preset wake-up word.

Take the electronic device as a smart speaker, and the wake-up word of the smart speaker is "Xiaoyi Xiaoyi" as an example. If the user makes a sound containing "Xiaoyi Xiaoyi", the smart speaker detects "Xiaoyi Xiaoyi" from the received sound, and the smart speaker wakes up. Sometimes, smart speakers can also play a wake-up response voice to interact with the user's voice. For example, "Xiaoyi is here, what can I do for you". However, in some scenarios, the user or other devices emit a sound, but the sound does not contain "Xiaoyi Xiaoyi", which causes the smart speaker to be awakened by mistake. For example, the user is watching TV. The user did not pronounce "Xiaoyi Xiaoyi", and the sound from the TV did not contain "Xiaoyi Xiaoyi", but the smart speaker was mistakenly awakened. In this way, the normal life of the user is affected, and the user is required to turn off the smart speaker additionally, which brings a bad experience to the user.

SUMMARY OF THE INVENTION

In order to solve the above technical problems existing in the prior art, the present application provides an electronic device and a wake-up method thereof, which can reduce the false wake-up rate of the electronic device and improve user experience.

In a first aspect, a wake-up method is provided. The method is applied to an electronic device comprising a microphone and a speaker, the microphone including a plurality of microphones. The method includes: receiving a sound; calculating a wake-up word confidence level of the sound; the wake-up word confidence level is used to indicate the probability that the sound includes a wake-up word; after the wake-up word confidence level is greater than or equal to a first threshold, calculating the sound source orientation of the sound; After the sound source azimuth is matched with one of the first azimuths in the first azimuth set, and after the first azimuth position reliability corresponding to the matched first azimuth is greater than or equal to the third threshold, wake up the electronic device; or, after the matching After the position reliability of the first party corresponding to the first orientation on the device is less than the third threshold, the electronic device is not woken up. The wake-up word is used to wake up the electronic device; the sound source orientation is the direction and position of the sound source relative to the electronic device; the first orientation set includes M first orientation elements, and each first orientation element includes a first orientation and a The first position reliability; the first position is the direction and position of the sound source that wakes up the electronic device relative to the electronic device, which is used to indicate that the electronic device has been woken up in the first position; the first position reliability is used to indicate that the first position The probability that the orientation wakes up the electronic device; M is a positive integer greater than or equal to 1. In this way, when the confidence of the wake-up word of the sound is greater than or equal to the first threshold, possible false wake-ups are further screened out according to the first set of orientations, thereby reducing the false-awakening probability of the electronic device and improving the user experience.

According to the first aspect, after the wake-up word confidence is greater than or equal to the first threshold, calculating the orientation of the sound source corresponding to the sound; including: after the wake-up word confidence is greater than or equal to the first threshold, and when the wake-up word confidence is less than the second threshold After that, calculate the azimuth of the sound source corresponding to the sound. In this way, by setting the second threshold, the situation where the confidence level of the wake-up word is between the first threshold and the second threshold is screened out, so that the processing efficiency of the electronic device can be improved.

According to the first aspect, or any implementation manner of the above first aspect, the sound source orientation matches one of the first orientation sets in the first orientation set; including: the sound source orientation relative to the direction of the electronic device matches the first orientation set One of the first orientations is relative to the direction of the electronic device, and the angular deviation of the two directions is within a preset fourth threshold; and the position of the sound source orientation relative to the electronic device is different from the position of the first orientation relative to the electronic device. , the position deviation of the two positions is within the preset fifth threshold.

According to the first aspect, or any implementation manner of the above first aspect, after the sound source orientation does not match any one of the first orientations in the first orientation set, a voiceprint is extracted from the sound; After matching with a first voiceprint in the first voiceprint set, and after the first voiceprint confidence corresponding to the first voiceprint is greater than or equal to the preset sixth threshold, wake up the electronic device; After the confidence level of the first voiceprint corresponding to the voiceprint is less than the preset sixth threshold, the electronic device is not woken up. The first voiceprint set includes L voiceprint elements, and each voiceprint element includes a first voiceprint and a first voiceprint confidence level. The first voiceprint is used to represent the voiceprint for waking up the electronic device, and the first voiceprint confidence is used to represent the probability that the first voiceprint wakes up the electronic device; L is a positive integer greater than or equal to 1. In this way, if the possible false awakening cannot be screened out according to the first orientation set, the possible false awakening is further screened through the first voiceprint set, thereby reducing the false awakening probability of the electronic device and improving the user experience.

According to the first aspect, or any implementation manner of the above first aspect, after waking up the electronic device, the method further includes: updating the first orientation set and the first voiceprint set.

According to the first aspect, or any implementation manner of the above first aspect, after the wake-up word confidence is greater than or equal to the second threshold, the electronic device is woken up, and the first orientation set and the first voiceprint set are updated.

According to the first aspect, or any implementation manner of the above first aspect, after the wake-up word confidence is greater than or equal to the second threshold, wake up the electronic device, and update the first orientation set and the first voiceprint set; including: after waking up After the word confidence is greater than or equal to the second threshold, wake up the electronic device, and create a first orientation set and a first voiceprint set; the orientation of waking up the electronic device is included in the first orientation set, and the voiceprint waking up the electronic device is included in the first orientation set. A voiceprint set is assigned, and an initial azimuth position confidence is given to the azimuths included in the first azimuth set, and an initial voiceprint confidence is given to the voiceprints included in the first voiceprint set.

In a second aspect, a wake-up method is provided. The method is applied to an electronic device comprising a microphone and a speaker, the microphone including a plurality of microphones. The method includes: receiving a sound; calculating a wake-up word confidence level of the sound; the wake-up word confidence level is used to indicate the probability that the sound includes a wake-up word; after the wake-up word confidence level is greater than or equal to a first threshold, calculating the sound source orientation of the sound; After the sound source azimuth matches one of the second azimuths in the second azimuth set, and after the position reliability of the second azimuth corresponding to the matched second azimuth is greater than or equal to the seventh threshold, wake up the electronic device; or, after the matching After the position reliability of the second party corresponding to the second orientation on the device is less than the seventh threshold, the electronic device is not awakened. The wake-up word is used to wake up the electronic device; the sound source orientation is the direction and position of the sound source relative to the electronic device; the second orientation set includes N second orientation elements, and each second orientation element includes a second orientation and a The second position reliability, the second position is the direction and position of the sound source that does not wake up the electronic device relative to the electronic device, which is used to indicate that the electronic device is not woken up in the second position, and the second position reliability is used to indicate that in the first position The probability of not waking up the electronic device in two directions; N is a positive integer greater than or equal to 1. In this way, when the confidence of the wake-up word of the sound is greater than or equal to the first threshold, possible false wake-ups are further screened out according to the second orientation set, thereby reducing the probability of false wake-up of the electronic device and improving the user experience.

According to the second aspect, the sound source orientation is matched with a second orientation in the second set of orientations; including: the orientation of the sound source relative to the direction of the electronic device, and the orientation of one of the second orientations in the second set of orientations relative to the electronic device , the angular deviation of the two directions is within the preset eighth threshold; and, the position of the sound source azimuth relative to the position of the electronic device, and the position of the second azimuth relative to the position of the electronic device, the position deviation of the two positions is within the preset eighth within nine thresholds.

According to the second aspect, or any implementation manner of the above second aspect, the method further includes: after the sound source azimuth does not match any second azimuth in the second azimuth set, extracting from the sound Voiceprint; after the voiceprint does not match any of the first voiceprints in the first voiceprint set, update the second orientation set. Wherein, the first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence, the first voiceprint is used to represent the voiceprint for waking up the electronic device, the first The voiceprint confidence is used to represent the probability that the first voiceprint wakes up the electronic device; L is a positive integer greater than or equal to 1. In this way, if the possible false awakening cannot be screened out according to the second orientation set, the possible false awakening is further screened through the first voiceprint set, thereby reducing the false awakening probability of the electronic device and improving the user experience.

According to the second aspect, or any implementation manner of the above second aspect, the method further includes: after the sound source azimuth does not match any second azimuth in the second azimuth set, extracting from the sound Voiceprint; wake up the electronic device after the voiceprint matches one of the first voiceprints in the first voiceprint set, and after the confidence level of the first voiceprint corresponding to the first voiceprint is greater than or equal to the preset tenth threshold ; or, after the confidence level of the first voiceprint corresponding to the first voiceprint is smaller than the preset tenth threshold, the electronic device is not woken up, and the second orientation set is updated. The first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence level, and the first voiceprint confidence level is used to indicate that the first voiceprint wakes up the electronic device The probability of , the first voiceprint is used to represent the voiceprint for waking up the electronic device; L is a positive integer greater than or equal to 1.

According to the second aspect, or any implementation manner of the above second aspect, after waking up the electronic device, the method further includes: updating the first voiceprint set; after not waking up the electronic device, the method further includes: updating the second Azimuth collection.

According to the second aspect, or any implementation manner of the above second aspect, after the wake-up word confidence is greater than or equal to the second threshold, the electronic device is woken up, and the first voiceprint set is updated. In this way, by setting the second threshold, the situation where the confidence level of the wake-up word is between the first threshold and the second threshold is screened out, so that the processing efficiency of the electronic device can be improved.

In a third aspect, a wake-up method is provided. The method is applied to an electronic device comprising a microphone and a speaker, the microphone including a plurality of microphones. The method includes: receiving a sound; calculating a wake-up word confidence level of the sound; the wake-up word confidence level is used to indicate the probability that the sound includes a wake-up word; after the wake-up word confidence level is greater than or equal to a first threshold, calculating the sound source orientation of the sound; After the sound source azimuth matches a second azimuth in the second azimuth set, and after the sound source azimuth does not match any first azimuth in the first azimuth set, and the matching second azimuth corresponds to After the second party location reliability is greater than or equal to the eleventh threshold, the electronic device is awakened; or, after the second party location reliability corresponding to the matched second orientation is less than the eleventh threshold, the electronic device is not awakened. The wake-up word is used to wake up the electronic device; the sound source orientation is the direction and position of the sound source relative to the electronic device; the first orientation set includes M first orientation elements, and each first orientation element includes a first orientation and a The first position reliability; the first position is the direction and position of the sound source that wakes up the electronic device relative to the electronic device, which is used to indicate that the electronic device has been woken up in the first position; the first position reliability is used to indicate that the first position The probability that the orientation wakes up the electronic device; the second orientation set includes N second orientation elements, each of which includes a second orientation and a second orientation reliability; the second orientation is a sound source that does not wake up the electronic device Relative to the direction and position of the electronic device, it is used to indicate that the electronic device is not woken up in the second orientation; the second position reliability is used to indicate the confidence that the electronic device has not been woken up in the second orientation; M and N are both greater than or equal to 1 positive integer of . In this way, when the confidence of the wake-up word of the sound is greater than or equal to the first threshold, possible false wake-ups are further screened out according to the first set of orientations and the second set of orientations, thereby reducing the probability of mis-awakening of electronic devices and improving user experience.

According to a third aspect, the sound source azimuth is matched with a second azimuth in the second azimuth set; including: the sound source azimuth relative to the direction of the electronic device, and a second azimuth in the second azimuth set relative to the direction of the electronic device , the angular deviation of the two directions is within the preset twelfth threshold; and, the position deviation of the sound source azimuth relative to the position of the electronic device and the position of the second azimuth relative to the electronic device is within the preset twelfth threshold. Within the thirteenth threshold; the sound source orientation does not match any first orientation in the first orientation set; including: the sound source orientation relative to the direction of the electronic device is relative to any first orientation in the first orientation set The direction of the electronic device, the angular deviation of the two directions is not within the preset fourteenth threshold; and, the position of the sound source azimuth relative to the electronic device is the same as the position of any first azimuth in the first azimuth set relative to the electronic device. position, the position deviation of both positions is not within the preset fifteenth threshold.

According to the third aspect, or any implementation manner of the above third aspect, the method further includes: after the sound source azimuth is matched with a first azimuth in the first azimuth set, and after the sound source azimuth is matched with the second azimuth set After any one of the second orientations does not match, and after the position reliability of the first party corresponding to the matching first orientation is greater than or equal to the sixteenth threshold, wake up the electronic device; After the position reliability of the first party corresponding to the orientation is less than the sixteenth threshold, the electronic device is not awakened.

According to the third aspect, or any one of the implementation manners of the above third aspect, the sound source orientation matches one of the first orientation sets in the first orientation set; including: the sound source orientation relative to the direction of the electronic device is matched with the first orientation set One of the first orientations is relative to the direction of the electronic device, and the angular deviation of the two directions is within the preset fourteenth threshold; and, the position of the sound source orientation relative to the electronic device is different from the first orientation relative to the electronic device. position, the position deviation of the two positions is within the preset fifteenth threshold; the sound source azimuth does not match any second azimuth in the second azimuth set; including: the sound source azimuth relative to the direction of the electronic device, and The angular deviation of any second azimuth in the second azimuth set relative to the direction of the electronic device is not within the preset twelfth threshold; For the position of any second orientation in the set relative to the position of the electronic device, the position deviation of the two positions is not within the preset thirteenth threshold.

According to the third aspect, or any implementation manner of the above third aspect, the method further includes: after the sound source azimuth does not match any second azimuth in the second azimuth set, and after the sound source azimuth matches the first azimuth After any one of the first orientations in the set of orientations does not match, the voiceprint is extracted from the sound; after the voiceprint is matched with a first voiceprint in the first set of voiceprints, and in the first voiceprint After the corresponding first voiceprint confidence level is greater than or equal to the preset sixteenth threshold, wake up the electronic device, and update the first orientation combination and the first voiceprint set; or, in the first voiceprint corresponding to the first voiceprint After the confidence level is less than the preset sixteenth threshold, the electronic device is not woken up, and the second set of orientations is updated. The first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence level, and the first voiceprint confidence level is used to indicate that the first voiceprint wakes up the electronic device The probability of , the first voiceprint is used to represent the voiceprint for waking up the electronic device; L is a positive integer greater than or equal to 1. In this way, if possible false wakeups cannot be screened out according to the first set of orientations and the second set of orientations, the first voiceprint set is further screened for possible false wakeups, thereby reducing the probability of false wakeups of the electronic device and improving user experience.

According to the third aspect, or any implementation manner of the above third aspect, the method further includes: after the voiceprint does not match any one of the first voiceprints in the first voiceprint set, updating the second orientation set.

According to the third aspect, or any implementation manner of the above third aspect, the method further includes: after waking up the electronic device, updating the first set of orientations; after not waking up the electronic device, updating the second set of orientations.

In a fourth aspect, an electronic device is provided. The electronic device includes a pickup and a speaker, the pickup includes a plurality of microphones, and the electronic device further includes: a processor; a memory; and a computer program, wherein the computer program is stored in the memory, and when the computer program is executed by the processor, the electronic device executes the The method described in the first aspect and any implementation manner of the first aspect, the second aspect and any implementation manner of the second aspect, and the third aspect and any implementation manner of the third aspect.

For the technical effects corresponding to the fourth aspect and any one of the implementations of the fourth aspect, please refer to the above-mentioned first aspect and any one of the implementations of the first aspect, the second aspect and any one of the implementations of the second aspect, and the third. The technical effects corresponding to any one of the implementation manners of the aspect and the third aspect will not be repeated here.

In a fifth aspect, a computer-readable storage medium is provided. The computer-readable storage medium includes a computer program that, when the computer program runs on an electronic device, causes the electronic device to perform the first aspect and any one of the implementations of the first aspect, the second aspect and any one of the second aspect Implementations, the third aspect and the method of any one of the implementations of the third aspect, wherein the electronic device includes a pickup and a speaker, and the pickup includes a plurality of microphones.

For the technical effect corresponding to any one of the implementation manners of the fifth aspect and the fifth aspect, reference may be made to any implementation manner of the first aspect and the first aspect, any implementation manner of the second aspect and the second aspect, and the third aspect. The technical effects corresponding to any one of the implementation manners of the aspect and the third aspect will not be repeated here.

In a sixth aspect, a computer program product is provided. When it runs on a computer, it causes the computer to execute the first aspect and any one of the implementations of the first aspect, the second aspect and any one of the implementations of the second aspect, and any one of the third aspect and the third aspect method described in an implementation.

For the technical effect corresponding to any one of the implementations of the sixth aspect and the sixth aspect, reference may be made to any implementation of the first aspect and the first aspect, the second aspect and any one of the implementations of the second aspect, and the third aspect. The technical effects corresponding to any one of the implementation manners of the aspect and the third aspect will not be repeated here.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

1 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application;

2 is a schematic diagram of a software structure of an electronic device provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a scenario of a wake-up method provided by an embodiment of the present application;

4 is a schematic diagram of a graphical user interface set by a user in a wake-up method provided by an embodiment of the present application;

FIG. 5 is a flowchart of an embodiment of a wake-up method provided by an embodiment of the present application;

FIG. 6 is a flowchart of another embodiment of the wake-up method provided by the embodiment of the present application;

FIG. 7 is a flowchart of another embodiment of the wake-up method provided by the embodiment of the present application;

FIG. 8 is a flowchart of another embodiment of the wake-up method provided by the embodiment of the present application;

FIG. 9 is a flowchart of another embodiment of the wake-up method provided by the embodiment of the present application;

FIG. 10 is a flowchart of another embodiment of the wake-up method provided by the embodiment of the present application;

FIG. 11 is a schematic structural composition diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The terms used in the following embodiments are for the purpose of describing particular embodiments only, and are not intended to be limitations of the present application. As used in the specification of this application and the appended claims, the singular expressions "a," "an," "the," "above," "the," and "the" are intended to also Expressions such as "one or more" are included unless the context clearly dictates otherwise. It should also be understood that, in the following embodiments of the present application, "at least one" and "one or more" refer to one, two or more (including two). The term "and/or", used to describe the association relationship of related objects, indicates that there can be three kinds of relationships; for example, A and/or B, can indicate: A alone exists, A and B exist at the same time, and B exists alone, A and B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship.

References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise. The term "connected" includes both direct and indirect connections unless otherwise specified.

The terms used in the embodiments of the present application are only used to explain specific embodiments of the present application, and are not intended to limit the present application.

In one example, the probability of false wake-up generated by the electronic device is reduced by optimizing the wake-up word model preset in the electronic device. The main function of the wake-up word model is to detect the wake-up word from the sound picked up by the electronic device, and obtain the probability that the sound contains the wake-up word. The wake-up word model is a trained machine learning model. For example, a model for detecting wake-up words can be established in advance, and a wake-up word model can be obtained by training the model with samples. The above-mentioned pre-established model may be a neural network model, a Gaussian mixture model, a hidden Markov model, or the like. The above-mentioned samples may be sounds containing wake-up words, or phoneme sequences of sounds containing wake-up words, or audio features of sounds containing wake-up words, or the like. Voices containing wake words can be recorded by different people in different scenarios. Using the sounds containing wake words recorded by different people in different scenarios can enable the trained wake word model to detect wake words in sounds in various scenarios. The sounds recorded in different scenarios do not only include wake words, but may include noise (such as non-wake words). In this way, if the sounds recorded in different scenarios are used as samples to train the wake-up word model, the wake-up word model will be polluted, so that the wake-up word model may recognize sounds including non-wake words as wake-up word sounds, resulting in false wake-up. . Taking the smart speaker with the wake-up word "Xiaoyi Xiaoyi" as an example, after the wake-up word model trained based on the above method is set in the smart speaker, the wake-up word model may mix the sound picked up by the smart speaker with "Xiaoyi". "Xiaoyi" sounds with similar or even completely different pronunciations are detected as sounds containing wake-up words, thus causing the smart speaker to be awakened by mistake.

In order to minimize the pollution of the wake-up word model caused by unclean samples, and the problem of false wake-up of electronic devices caused by it, it is necessary to continuously optimize the wake-up word model. Specifically, iteratively optimizes the wake-up word model through data annotation. However, data labeling requires manual labeling of the sounds used as samples, which consumes too much human resources, and the optimized wake-up word model still has a certain probability of false wake-up. To this end, embodiments of the present application provide an electronic device and a wake-up method, which can reduce the false wake-up probability of the electronic device and improve user experience.

The electronic device provided by the embodiment of the present application is an electronic device with a function of picking up sound and a function of broadcasting voice. For example: smart speakers, smart phones, tablet computers, personal computers (PCs), wearable devices (such as smart glasses, smart watches, smart bracelets, etc.), smart home appliances such as smart TVs, smart screens, smart network connections Vehicle (intelligent connected vehicle, ICV), intelligent (car) car (smart/intelligent car) or in-vehicle equipment, etc.

Exemplarily, FIG. 1 shows a schematic structural diagram of an electronic device 100 . The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.

It can be understood that, the structures illustrated in the embodiments of the present invention do not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. Exemplarily, the electronic device 100 may be a smart speaker. The smart speaker may include: a processor 110, an internal memory 121, a speaker 170A, and a microphone 170C.

The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (neural-network processing unit, NPU), etc. part or all of it. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and the like.

The I2S interface can be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled with the audio module 170 through an I2S bus to implement communication between the processor 110 and the audio module 170 . In some embodiments, the audio module 170 can transmit sound to the wireless communication module 160 through the I2S interface, so as to realize the function of answering calls through a Bluetooth headset.

The PCM interface can also be used for audio communications, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 can also transmit sound to the wireless communication module 160 through the PCM interface, so as to realize the function of answering calls through a Bluetooth headset. Both the I2S interface and the PCM interface can be used for audio communication.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present invention is only a schematic illustration, and does not constitute a structural limitation of the electronic device 100 . In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.

The electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100 . The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data (such as audio data, phone book, etc.) created during the use of the electronic device 100 and the like. In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like. The processor 110 executes various functional applications and data processing of the electronic device 100 by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.

Audio module 170 is used to convert digital audio information to analog sound output, and also to convert analog audio input to digital sound. Audio module 170 may also be used to encode and decode sound. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .

Speaker 170A, also referred to as a "speaker", is used to convert audio electrical signals into sound signals. The electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.

The receiver 170B, also referred to as "earpiece", is used to convert audio electrical signals into sound signals. When the electronic device 100 answers a call or a voice message, the voice can be answered by placing the receiver 170B close to the human ear.

The microphone 170C, also called "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can make a sound through the human mouth close to the microphone 170C, and input the sound signal into the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions.

The earphone jack 170D is used to connect wired earphones. The earphone interface 170D can be the USB interface 130, or can be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.

Motor 191 can generate vibrating cues. The motor 191 can be used for vibrating alerts for incoming calls, and can also be used for touch vibration feedback. For example, touch operations acting on different applications (such as taking pictures, playing audio, etc.) can correspond to different vibration feedback effects. The motor 191 can also correspond to different vibration feedback effects for touch operations on different areas of the display screen 194 . Different application scenarios (for example: time reminder, receiving information, alarm clock, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also support customization.

The indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.

The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 as an example.

FIG. 2 is a block diagram of a software structure of an electronic device 100 according to an embodiment of the present invention.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, which are, from top to bottom, an application layer, an application framework layer, an Android runtime (Android runtime) and a system library, and a kernel layer.

The application layer can include a series of application packages.

As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and so on.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 2, the application framework layer may include window managers, content providers, view systems, telephony managers, resource managers, notification managers, and the like.

A window manager is used to manage window programs. The window manager can get the size of the display screen, determine whether there is a status bar, lock the screen, take screenshots, etc.

Content providers are used to store and retrieve data and make these data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, and so on. View systems can be used to build applications. A display interface can consist of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures.

The phone manager is used to provide the communication function of the electronic device 100 . For example, the management of call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localization strings, icons, pictures, layout files, video files and so on.

The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages, and can disappear automatically after a brief pause without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also display notifications in the status bar at the top of the system in the form of graphs or scroll bar text, such as notifications of applications running in the background, and notifications on the screen in the form of dialog windows. For example, text information is prompted in the status bar, a prompt sound is issued, the electronic device vibrates, and the indicator light flashes.

Android Runtime includes core libraries and a virtual machine. Android runtime is responsible for scheduling and management of the Android system.

The core library consists of two parts: one is the function functions that the java language needs to call, and the other is the core library of Android.

The application layer and the application framework layer run in virtual machines. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, safety and exception management, and garbage collection.

A system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), 3D graphics processing library (eg: OpenGL ES), 2D graphics engine (eg: SGL), etc.

The Surface Manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.

2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, and sensor drivers.

For ease of understanding, the following embodiments of the present application will take the electronic device having the structure shown in FIG. 1 and FIG. 2 as an example, and combine the drawings and application scenarios to specifically describe the methods provided by the embodiments of the present application. It should be noted that although FIG. 2 is used as an example for the software structure of the electronic device, the software structure shown in FIG. 2 is only a schematic example, and the software structures of other operating systems are also applicable to the wake-up method provided in this embodiment of the present application.

For convenience of description, the wake-up method provided by the embodiment of the present application is illustrated by taking the electronic device as a smart speaker and the smart speaker is in a home environment as an example. FIG. 3 is a schematic diagram of a scenario of a wake-up method provided by an embodiment of the present application. As shown in Figure 3, in addition to smart speakers, the home environment is also equipped with other devices with a function of broadcasting sound, such as TVs and traditional speakers, and furniture such as sofas and dining tables. Users can move around the sofa, dining table, etc., and wake up the smart speaker by uttering a voice containing a wake-up word. Smart speakers can also be placed in other scenarios. Such as shopping malls, office environments, etc. By executing the wake-up method of the embodiment of the present application, the false wake-up probability of the smart speaker can also be reduced. Hereinafter, the specific implementation of the wake-up method provided by the embodiments of the present application will be described.

When the smart speaker is not awakened and in the standby state, it picks up the sound in the environment to obtain the sound. The voice includes the voice of the target speaker, that is, the user, and also includes noise signals in the environment. For this reason, noise reduction processing is generally performed on the received sound to obtain a clean sound, which is used as the sound for triggering the execution of the wake-up method in the embodiment of the present application.

In the wake-up method of the embodiment of the present application, at least one of a wake-up orientation set, a false wake-up orientation set and a voiceprint set is set in the smart speaker. The set of mis-awakened orientations includes: mis-awakened orientations and confidence levels of the mis-awakened orientations. Hereinafter, an element in the false-awakening location set is represented by (false-awakening location, confidence). The false wake-up position is used to record the sound source position of the sound that did not wake up the smart speaker. The confidence level of the false wake-up position is used to describe the probability that the voice to wake up the smart speaker is issued at the false wake-up position. Confidence can identify the probability by the magnitude of the value. For example, the larger the value, the higher the probability, and the smaller the value, the lower the probability. The orientation in the embodiments of the present application refers to the direction and position relative to the smart speaker, for example, the orientation of the sound source refers to the direction and position of the sound source relative to the smart speaker.

The wake-up position set includes: the wake-up position, and the confidence level of the wake-up position. Hereinafter, an element in the set of wake-up positions is represented by (wake-up position, confidence). The wake-up location is used to record the sound source location of the sound that wakes up the smart speaker. The confidence of the wake-up position is used to describe the probability that the voice to wake up the smart speaker is issued at the wake-up position.

The voiceprint set includes: wake-up voiceprint, and the confidence level of the wake-up voiceprint. The confidence of the wake-up voiceprint can be represented by the number of hits of the wake-up voiceprint. In the following, an element in the voiceprint set is represented by (wakeup voiceprint, confidence). The wake-up voiceprint is used to record the user's voiceprint of the sound that wakes up the smart speaker. The confidence of the wake-up voiceprint is used to record the probability that the sound with the wake-up voiceprint wakes up the smart speaker. The number of hits is used to record the number of times the sound with the wake-up voiceprint wakes up the smart speaker. User voiceprint and wake-up voiceprint can be represented by parameter values of voiceprint feature parameters. The above-mentioned voiceprint feature parameters may include, but are not limited to, intensity, wavelength, frequency, rhythm, and the like, for example. The parameter value of at least one voiceprint feature parameter is different between different voiceprints.

The possible representation methods and calculation methods of the sound source orientation are described as follows: Optionally, a coordinate system of the smart speaker can be established. For example, the origin of the coordinate system may be the physical center point of the smart speaker, and the positive direction of the x-axis may be the direction pointing horizontally to the front of the smart speaker. The method for establishing the coordinate system is only an example, and is not intended to limit the method for establishing the coordinate system of the smart speaker. The sound source orientation can be identified by distance and angle in the above-mentioned coordinate system. Specifically, the distance of the sound source azimuth can be used to record: the distance between the sound source of the sound and the origin of the coordinate system of the smart speaker. The angle can be used to record: the angle between the origin of the coordinate system of the smart speaker pointing to the ray of the sound source of the sound and the positive direction of the x-axis of the smart speaker. Optionally, a parameter of the dimension of height may be further added to the sound source orientation. Height can be used to record: the vertical distance between the source of the sound and the origin of the coordinate system. Information such as the distance and angle of the sound source azimuth can be calculated by the smart speaker based on the relevant sound source localization method. The sound source localization method can calculate the relative position between the sound source and the smart speaker based on a microphone array composed of at least two microphones set in the smart speaker. For example distance and angle etc. Specifically, sound source localization methods may include, but are not limited to: controllable beamforming technology based on maximum output power, high-resolution spectrogram estimation technology, and sound source localization technology based on time-delay estimation (TDE), etc. . Taking the TDE-based algorithm as an example, its core lies in the accurate estimation of the propagation delay. The propagation delay is generally obtained by performing cross-correlation processing on the sounds picked up by the microphone array of the smart speaker. After that, the distance between the smart speaker and the sound source can be calculated by simple delay summation, geometric calculation, or direct use of cross-correlation results to search for controllable power response. Specific Algorithms The embodiments of the present application are not expanded one by one.

The initial setting of the set: in the initial case, for example, when the smart speaker is not used from the factory or the factory settings are restored, at least one of the preset false wake-up orientation set, the wake-up orientation set, and the voiceprint set in the smart speaker can be empty. . In the process of using the smart speaker, the user can set at least one of the false wake-up orientation set, the wake-up orientation set and the voiceprint set based on the environment in which the smart speaker is located, or not set. If the user does not make settings, user operations can be reduced and user experience can be improved.

The method of setting the above sets is given as an example: since the false wake-up orientation records the sound source orientation of the sound that does not wake up the smart speaker, the false wake-up orientation generally corresponds to the orientation of other devices that can emit sound in the environment relative to the smart speaker. Based on this, the mis-awakening orientation can be set based on the orientation of other devices capable of making sounds in the environment relative to the smart speaker, and the user or the smart speaker can set an initial confidence level for the mis-awakening orientation. Taking the home environment shown in FIG. 3 as an example, when the smart speaker has a display screen, the user can be provided with a setting interface for the wrong wake-up orientation on the display screen of the smart speaker. For example, as shown in Figure 4, the user can set a false wake-up position and the confidence of the false wake-up position based on the relative position between the TV and the smart speaker; set a false wake-up position based on the relative position between the traditional speaker and the smart speaker and the Confidence of the false wake-up position; then click the "OK" control; correspondingly, the smart speaker detects the user's operation on the "OK" control in the setting interface, and obtains the false wake-up position and confidence in the setting interface and other information. Save it in the false wakeup location collection. Optionally, if the smart speaker does not include a display screen or the display screen is inconvenient for the user to operate, the setting interface can be displayed to the user by other devices associated with the smart speaker (such as the user's smartphone, etc.), and other devices will be obtained from the setting interface. The received false wake-up position, and information such as confidence are sent to the smart speaker.

Since the wake-up position records the sound source position of the sound that wakes up the smart speaker, the wake-up position generally corresponds to the position in the environment where the user often sends out the voice to wake up the smart speaker relative to the position of the smart speaker. Based on this, the user can set the wake-up orientation based on the position where the user often wakes up the smart speaker relative to the orientation of the smart speaker, and the user or the smart speaker can set an initial confidence level for the wake-up orientation. Taking the home environment shown in Figure 3 as an example, ordinary users often move around sofas, dining tables, etc. and wake up smart speakers. Therefore, one or more wake-up positions and corresponding confidence levels can be set based on the position on the sofa relative to the position of the smart speaker, and one or more wake-up positions can be set based on the position near the dining table, such as the position of the dining chair relative to the position of the smart speaker, and corresponding confidence. For a specific setting method, reference may be made to the setting method of the false wake-up orientation shown in FIG. 4 , which will not be repeated here.

The wake-up voiceprint can be set by the user by recording the voice. Correspondingly, the smart speaker obtains the user's voiceprint according to the sound obtained by recording the voice, and sets it as the wake-up voiceprint. The user or the smart speaker sets the initial confidence level of the wake-up voiceprint. For example, if the confidence is the number of hits, the initial confidence may be 0.

Update of the set: During the use of the smart speaker, when the smart speaker is awakened, the wake-up orientation set can be updated according to the sound source orientation of the sound that wakes up the smart speaker; the voiceprint set can be updated according to the user voiceprint extracted from the sound; When the smart speaker is not awakened, the set of mis-awakened orientations is updated according to the sound source orientation of the sound of the unawakened smart speaker.

Based on the sound of waking up the smart speaker, calculate the sound source orientation of the sound, and determine whether the wake-up orientation set includes the sound source orientation of the sound. If included, increase the confidence of the wake-up orientation corresponding to the sound source orientation; if not, add the sound source orientation as the wake-up orientation to the wake-up orientation set, and set the initial confidence level for the newly added wake-up orientation. Wherein, when the intelligent voice signal judges whether the wake-up orientation set includes the sound source orientation of the sound, the sound source orientation may be completely consistent with a wake-up orientation, or may have a certain deviation. For example, when the wake-up azimuth and the sound source azimuth are respectively represented by (distance, angle), the distance threshold and the angle threshold can be preset separately. If the distance difference between the sound source azimuth and the wake-up azimuth 1 satisfies the distance threshold, and the angle difference satisfies the angle threshold, it can be determined that the wake-up azimuth set includes the sound source azimuth. The wake-up orientation 1 may be referred to as the wake-up orientation corresponding to the sound source orientation, or may be referred to as the wake-up orientation including the sound source orientation. Correspondingly, the confidence of the wake-up orientation 1 corresponding to the sound source orientation is improved. It should be noted that the embodiment of the present application does not limit the setting value of the initial confidence level. The embodiments of the present application also do not limit the extent to which the confidence is increased each time the confidence of the wake-up orientation is increased. For example, the magnitude may be a fixed value, or a fixed percentage of the confidence level, or the like. Similarly, the embodiments of the present application also do not limit the specific values of the preset distance threshold and angle threshold. The distance threshold and the angle threshold may be determined based on the accuracy of the wake-up method, the accuracy of the sound source orientation calculation method, and the like. Specifically, the higher the accuracy of the wake-up method, the smaller the distance threshold and the angle threshold are; the higher the accuracy of the sound source orientation calculation method, the smaller the distance and angle thresholds are. In addition, the setting of the distance threshold and the angle threshold can expand the wake-up orientation in the wake-up orientation set from a point to an area, and the distance threshold and the angle threshold can be set based on the size of the desired expansion area. The distance threshold and the angle threshold can be adjusted by the user of the smart speaker according to their needs.

Based on the sound that wakes up the smart speaker, the smart speaker can extract the user's voiceprint of the sound, and determine whether the extracted user's voiceprint is included in the wake-up voiceprint of the voiceprint set. If included, increase the confidence of the wake-up voiceprint; otherwise, add the user's voiceprint as a wake-up voiceprint to the voiceprint set, and set the confidence for the newly added wake-up voiceprint. Similar to the judgment of the wake-up orientation set, in judging whether the voiceprint set includes the extracted user voiceprint, it is also possible to allow a certain error between the user's voiceprint and the wake-up voiceprint. For example, a threshold may be set for each voiceprint feature included in the voiceprint, as long as the difference between the value of each voiceprint feature of the user's voiceprint and the value of the corresponding voiceprint feature of a wake-up voiceprint is smaller than the corresponding value of the voiceprint feature It can be considered that the voiceprint set includes the user's voiceprint, and the above-mentioned one of the wake-up voiceprints is the wake-up voiceprint corresponding to the user's voiceprint.

Based on the sound that does not wake up the smart speaker, the smart voice signal calculates the sound source orientation of the sound. The intelligent voice signal judges whether the set of mis-awakened orientations includes the orientation of the sound source of the sound. If included, reduce the confidence level of the false wake-up orientation corresponding to the sound source orientation; if not, add the sound source orientation as the mis-awakened orientation to the set of mis-awakened orientations, and set the initial confidence level for the newly added mis-awakened orientation. For the implementation of the update of the erroneously awakened orientation set, reference may be made to the relevant description in the update of the awakened orientation set, which will not be repeated here.

The wake-up method of the embodiment of the present application determines whether to execute the wake-up process based on the wake-up word confidence output by the wake-up word model, the false wake-up location set and/or the wake-up location set, and the voiceprint set, thereby reducing the probability of false wake-up. The wake-up method will be described in detail below.

In one embodiment, the smart speaker includes a pickup and a speaker. Wherein, the pickup includes a microphone array, and the microphone array includes a plurality of microphones.

As shown in FIG. 5 , the smart speaker is preset with a wake-up orientation set (also referred to as a first orientation set), a false wake-up orientation set (also referred to as a second orientation set) and a voiceprint set (also referred to as a first orientation set) voiceprint collection). The wake-up method in this embodiment of the present application may include:

Step 501: The smart speaker picks up the sound in the environment to obtain the sound.

Since smart speakers generally pick up sounds in the environment continuously, smart speakers generally divide the continuously picked up sounds into audio segments for a certain duration. The sound in the embodiment of the present application generally refers to the divided audio segment. The specific duration of the audio segment is not limited in this embodiment of the present application.

In order to reduce the influence of noise on subsequent processing, the smart speaker generally performs noise reduction processing on the sound before executing step 502, so as to suppress noise signals in the sound and obtain a relatively clean sound. In this way, the sound used in step 502 is generally the sound after noise reduction processing.

Since the smart speaker continues to pick up sound, in order to reduce the data processing amount and power consumption of the smart speaker, preset conditions such as sound intensity threshold can be set for the sound picked up by the sound. Only the sound that meets the preset conditions will wake up based on the wake-up word model. word confidence to trigger subsequent processing. Specific preset conditions are not limited in this embodiment of the present application.

Step 502: The smart speaker calculates the wake-up word confidence of the sound based on the wake-up word model.

The wake word confidence is used to describe the probability that the sound includes the wake word sound.

Step 503 : the smart speaker determines whether the confidence level of the wake-up word is less than the first threshold; if the confidence level of the wake-up word is not less than the first threshold, step 504 is executed.

Further, step 503 further includes: if it is less than the first threshold, do not execute the wake-up process, update the false wake-up orientation set according to the sound source orientation of the sound, and this branch process ends.

It should be noted that the judgment in step 503 can also be performed by a wake-up word model, so that the wake-up word model can output two parameters, a judgment result of whether to wake up and a wake-up word confidence, which are not limited in this embodiment of the present application.

The wake word confidence is used to describe the probability that the sound includes the wake word sound. The higher the wake word confidence, the greater the probability that the sound includes the wake word sound. The wake-up method of the embodiment of the present application further performs the following steps 504 to 511 to further determine whether to execute the wake-up process, thereby realizing the screening of false wake-ups and reducing the probability of false wake-ups.

Step 504: The smart speaker determines whether the confidence level of the wake-up word is less than the second threshold, and the second threshold is greater than the first threshold. If it is not less than the second threshold, execute the wake-up process, and execute step 511; if it is less than the second threshold, execute step 505 .

In the embodiment of the present application, the situation where the confidence of the wake-up word is not less than the first threshold is further divided into two types by the second threshold: if the confidence of the wake-up word is not less than the second threshold, it means that the sound has a high probability of including the sound of the wake-up word, The probability of false wake-up is low, so the wake-up process is directly executed to wake up the smart speaker; if the confidence of the wake-up word is less than the second threshold and not less than the first threshold, it means that the probability that the sound includes the sound of the wake-up word is relatively low, and an error occurs. The probability of wake-up is relatively high, so the following steps 506 to 509 are performed, and further combined with the wake-up orientation set, the false-awakened orientation set, or the voiceprint set to determine whether to execute the wake-up process. For example, the value range of the wake word confidence is (0, 100), the first threshold is 30, and the second threshold is 80. Correspondingly, if the confidence of the wake-up word is less than 30, the wake-up process is not executed; if the confidence of the wake-up word is not less than 80, the wake-up process is directly executed; if the confidence of the wake-up word is less than 80 and not less than 30, the following steps 505~ are executed 509, further screen out possible false awakenings.

The wake-up location set, the false wake-up location set, or the voiceprint set may include at least one set element, and each set element includes at least two units. For example, the set elements included in the wake-up orientation set include the wake-up orientation and the confidence level corresponding to the wake-up orientation; the set elements included in the false-wake orientation set include the mis-awaken orientation and the confidence level corresponding to the mis-awaken orientation; the set elements included in the voiceprint set include the sound Corresponding confidence of print and voiceprint.

It should be noted that there is no order of execution between the steps of executing the wake-up process in step 504 and step 511 . In FIG. 5 , the wake-up process is executed first and then step 511 is executed as an example.

Both the first threshold and the second threshold may be preset.

Step 505: The smart speaker calculates the sound source orientation of the sound.

The calculation method of the sound source azimuth has been described in the foregoing description, and will not be repeated here.

Step 506: The smart speaker judges whether the set of mis-awakened orientations includes the sound source orientation of the sound, and judges whether the set of wake-up orientations includes the sound source orientation of the sound; if only the set of mis-awakened orientations includes the sound source orientation, perform step 507; if only the wake-up orientation set includes the sound source orientation If the set includes the sound source orientation, go to step 508; if it does not belong to the above two situations, go to step 509.

Step 507: The smart speaker judges whether to execute the wake-up process according to the confidence of the false wake-up orientation corresponding to the sound source orientation; if so, execute the wake-up process, and execute step 511; The position of the sound source is updated to the wrong wake-up position set, and the process of this branch ends.

Among them, the smart speaker determines whether to execute the wake-up process according to the confidence of the false wake-up position corresponding to the sound source position, which may include:

If the confidence of the false wake-up position is less than the threshold a, it is judged that the wake-up process is not executed;

If the confidence of the false wake-up position is not less than the threshold a, judge to execute the wake-up process;

If the confidence of the false wake-up position is less than the threshold a, it means that the probability of the sound is a noise signal is relatively high, so it is judged that the wake-up process is not executed, that is, the smart speaker is not woken up, thus reducing the probability of false wake-up.

Step 508: The smart speaker judges whether to execute the wake-up process according to the confidence of the wake-up orientation corresponding to the sound source orientation; if yes, execute the wake-up process, and go to step 511; if not, do not execute the wake-up process, and according to the sound source The orientation update wakes up the orientation set by mistake, and the process of this branch ends.

Among them, the smart speaker determines whether to execute the wake-up process according to the confidence of the wake-up position corresponding to the sound source position, which may include:

If the confidence of the wake-up orientation is less than the threshold b, it is judged that the wake-up process is not executed;

If the confidence of the wake-up orientation is not less than the threshold b, judge to execute the wake-up process;

If the confidence of the wake-up position is less than the threshold b, it means that the probability of the sound is a noise signal is relatively high, so it is judged that the wake-up process is not executed, that is, the smart speaker is not to be woken up, thereby reducing the probability of false wake-up.

Step 509: the smart speaker extracts the user voiceprint of the voice, and determines whether the wake-up voiceprint of the voiceprint set includes the extracted user voiceprint; if it does, go to step 510; if not, do not execute the wake-up process; according to the sound source of the sound The orientation update wakes up the orientation set by mistake, and this branch process ends;

Step 510: The smart speaker judges whether to execute the wake-up process according to the confidence of the wake-up voiceprint corresponding to the user's voiceprint; if so, execute the wake-up process, and execute step 511; The source location updates the false wake-up location set, and this branch process ends.

Wherein, after the determination result is yes, the execution sequence between the wake-up process and step 511 is not limited.

Among them, the smart speaker determines whether to execute the wake-up process according to the confidence of the wake-up voiceprint corresponding to the user's voiceprint, which may include:

If the confidence of the wake-up voiceprint is less than the threshold c, it is judged that the wake-up process is not executed;

If the confidence of the wake-up voiceprint is not less than the threshold c, judge to execute the wake-up process;

If the confidence level of the wake-up voiceprint is less than the threshold c, it means that the sound is more likely to be made by a person who does not often appear in the environment, so it is judged not to perform the wake-up process, that is, not to wake up the smart speaker, thus reducing the probability of false wake-up .

Step 511: The smart speaker updates the wake-up orientation set according to the sound source orientation of the sound, and updates the voiceprint set according to the user voiceprint of the sound, and this branch process ends.

For the implementation of this step, reference may be made to the foregoing description about the set update, which will not be repeated here.

It should be noted that, in the embodiment shown in FIG. 5 , the second threshold may not be set, that is, step 504 is not executed, but step 505 is directly executed. Or, in another possible implementation, the judgment in step 504 can be moved to

steps

507, 508, and 510 in FIG. 5 for execution, and the smart speaker combines the wake-up word confidence to determine whether to execute the wake-up process. The specific judgment The criteria can refer to the judgment criteria shown in FIG. 5 . Taking step 507 as an example, step 507 will be replaced by: the smart speaker judges whether to execute the wake-up process according to the confidence of the wake-up word and the confidence of the false wake-up orientation corresponding to the sound source orientation; if the judgment result is yes, then execute the wake-up process, and Step 511 is executed; if the judgment result is no, the wake-up process is not executed, and step 512 is executed. At this time, in step 507, the smart speaker judges whether to execute the wake-up process according to the confidence level of the wake-up word and the confidence level of the mis-awakened orientation corresponding to the sound source orientation, which may include:

If the confidence level of the wake-up word is not less than the second threshold, it is judged to execute the wake-up process;

If the confidence of the wake-up word is less than the second threshold, and the confidence of the false wake-up position is less than the first-party location confidence threshold, it is judged that the wake-up process is not executed;

If the confidence level of the wake-up word is less than the second threshold, and the confidence level of the mis-awakened orientation is not less than the first-party location confidence threshold, it is determined to execute the wake-up process.

It should be noted that in Figure 5, the smart speaker updates the wake-up orientation set and the voiceprint set every time it judges to execute the wake-up process, and updates the false wake-up orientation set every time it determines not to execute the wake-up process. Considering the data processing volume and power consumption reduction of the smart speaker, the above set may not be updated after each judgment to execute the wake-up process or not, but to update the above set after selecting certain judgments based on a certain rule. The embodiments of the present application are not limited.

After judging to execute the wake-up process, the smart speaker updates the wake-up location set and voiceprint set; after judging not to execute the wake-up process, it updates the false wake-up location set; in this way, during the gradual use of the smart speaker, the wake-up location set recorded in the wake-up location set is updated. The orientation can correspond to the position where the user often makes the wake-up word voice in the environment, and the mis-awakening orientation recorded in the false-awakening orientation set can correspond to the position of other devices that emit sound in the environment. The confidence level recorded in the voiceprint set The high wake-up voiceprint corresponds to the user's voiceprint who frequently wakes up the smart speaker, so that the wake-up method of the embodiment of the present application can better achieve the effect of reducing the probability of false wake-up.

For example: Assuming that the smart speaker is placed in the home environment for use after leaving the factory, the smart speaker is preset with wake-up orientation, false wake-up orientation and voiceprint sets, and the three sets are empty respectively; The second threshold is 0.7, the threshold a is 0.5, the threshold b is 0.6, and the threshold c is 5; then,

Assuming that the sound of the TV placed in the environment is loud, and the smart speaker picks up the sound of the TV to obtain sound 1, the wake-up word confidence of sound 1 is calculated based on the wake-up word model to be 0.1, which is less than the preset first threshold of 0.4, then go to step 503 In the branch where the judgment result is yes, no wake-up is performed, and the sound source orientation 1 of sound 1 is added to the false-awakening orientation set to obtain the false-awakening orientation 1, and an initial confidence level is set for it, for example, 0.8; It is very small. Therefore, the smart speaker picks up the sound of the TV again to obtain voice 2, and then calculates the wake-up word confidence of voice 2 to be 0.2, and executes the branch with the judgment result in 503 to reduce the false wake-up position 1 in the set of false wake-up positions. As the smart speaker performs the above process for many times, the confidence level of the false wake-up position 1 is reduced. Once the confidence level is reduced to below the threshold a 0.5, for example, 0.45, even if the smart speaker occasionally picks up the sound from the TV Obtain sound n, calculate the wake-up word confidence of sound n as a value between the first threshold of 0.4 and the second threshold of 0.7, such as 0.55, and the smart speaker executes

steps

503, 504, and 505 in sequence, according to the confidence of the false wake-up orientation 1 0.48, judging that it is less than the threshold a 0.5, the wake-up process will not be performed, so that possible false wake-up situations are screened out from the prior art execution of the wake-up process, thereby reducing the probability of false wake-up;

Even if the false wake-up caused by the TV sound has occurred before the confidence level of the false wake-up position 1 falls below the threshold a 0.5, for example, the smart speaker picks up the TV sound to obtain the sound m, and the calculated wake-up word confidence level is 0.65, and execute the steps in sequence Steps 503 to 505, according to the confidence of the false wake-up orientation 1, such as 0.55, determine that it is not less than the threshold a 0.5, and execute the wake-up process. At this time, the wake-up orientation set and the voiceprint set will be updated, so that the wake-up orientation set also includes the voice. Source azimuth 1; however, during the above use process, the smart speaker will also update the wake-up azimuth set and the voiceprint set according to the sound when the user wakes up the smart speaker. The confidence of the voiceprint is gradually improved; when the smart speaker subsequently obtains a sound with a wake-up word confidence between 0.4 and 0.7 at the sound source position where the TV is located, it can further judge whether to execute the wake-up process according to the voiceprint collection, so as to filter possible False wake-up reduces the probability of false wake-up.

In the wake-up method of the embodiment of the present application shown in FIG. 5, after calculating the wake-up word confidence of the sound based on the wake-up word model, it is further combined with the wake-up orientation set, the false-awakened orientation set and the voiceprint set to determine whether to wake up the smart speaker, so as to wake up the smart speaker. When the judgment result output by the word model is that the smart speaker is awakened, the possible false awakening is further screened, thereby reducing the false awakening probability of the smart speaker and improving the user experience.

Alternatively, the wake-up location set, false wake-up location set, and voiceprint set may also be non-preset, but created along with machine learning; after creation, continue to enrich and adjust based on machine learning.

Different from Fig. 5, the preset wake-up orientation set, false wake-up orientation set and voiceprint set in the smart speaker are taken as examples, in the embodiment shown in Fig. 6, the preset mis-awaken orientation set and voiceprint set in the smart speaker are as follows. For example, at this time, steps 506 to 511 are replaced by the following steps 601 to 605, specifically:

Step 601: the smart speaker determines whether the set of mis-awakened orientations includes the sound source orientation of the sound; if it does, go to step 602; if not, go to step 603;

For the implementation of this step, reference may be made to the above-mentioned related judgment method in the update of the false-awakened orientation set, which will not be repeated here.

Step 602: the smart speaker judges whether to execute the wake-up process according to the confidence of the false wake-up orientation corresponding to the sound source orientation; if the judgment result is yes, the wake-up process is executed, and step 605 is executed; if the judgment result is no, the wake-up process is not executed , and update the false wake-up orientation set according to the sound source orientation of the sound, and this branch process ends.

For the implementation of this step, please refer to the description in step 507, which is not repeated here.

Step 603: The smart speaker extracts the user's voiceprint of the voice, and determines whether the wake-up voiceprint of the voiceprint set includes the extracted user's voiceprint; if it does, go to step 604; The position of the sound source is updated to the wrong wake-up position set, and the process of this branch ends.

Step 604: The smart speaker judges whether to execute the wake-up process according to the confidence of the wake-up voiceprint corresponding to the user's voiceprint; if the judgment result is yes, execute the wake-up process, and execute step 605; The sound source orientation of the sound updates the set of mis-awakened orientations, and this branch process ends.

Step 605: The smart speaker updates the voiceprint set according to the user's voiceprint of the voice.

For the implementation of this step, please refer to the description in step 511, which is not repeated here.

The wake-up method of the embodiment of the present application shown in FIG. 6 calculates the wake-up word confidence level of the sound based on the wake-up word model, and further combines the false wake-up orientation set and the voiceprint set to determine whether to wake up the smart speaker, so as to determine whether to wake up the smart speaker based on the wake-up word model output. The result is that in the case of waking up the smart speaker, the possible false wake-up is further screened, thereby reducing the probability of false wake-up of the smart speaker and improving the user experience.

Different from the wake-up method shown in Figure 6, which takes the preset false wake-up orientation set and voiceprint set in the smart speaker as an example, the wake-up method shown in Figure 7 takes the preset wake-up orientation set and voiceprint set in the smart speaker as an example. The difference with Fig. 6 is mainly: the wrong wake-up orientation set is replaced by the wake-up orientation set, and, omitting the wrong wake-up orientation set update step, in step 705, the wake-up orientation set is updated according to the sound source orientation of the sound, according to the user voiceprint of the sound Update the voiceprint collection.

The implementation of judging whether to execute the wake-up process according to the confidence of the wake-up orientation in step 702 may refer to the description in step 508, which will not be repeated here.

When step 705 is performed for the first time, the first orientation set and the first voiceprint set are updated; including: creating the first orientation set and the first voiceprint set; incorporating the orientation for waking up the electronic device into the first orientation set, and assigning the The orientation of the first orientation set, an initial first party position reliability; and incorporating the voiceprint of waking up the electronic device into the first voiceprint set, and giving the voiceprint included in the first voiceprint set an initial first voiceprint grain confidence.

When step 705 is performed later, the first orientation set and the first voiceprint set are updated; including at least one of the following: creating a new first orientation in the first orientation set, and assigning a new first orientation to the newly created first orientation The initial first-party position reliability; create a new first voiceprint in the first voiceprint set, and give the newly created first voiceprint an initial first voiceprint confidence; For an existing first orientation on a match, increase the position reliability of the first party corresponding to the existing first orientation; for an existing first voiceprint on a match in the first voiceprint set, increase The first voiceprint confidence level corresponding to the existing first voiceprint.

Although step 705 in FIG. 7 is used as an example for illustration, steps 602 to 605 in FIG. 6 update the set of mis-awakened orientations, update the set of voiceprints, etc. similar to this. Steps 507 to 511 in FIG. 5 are similar. The update of the false wake-up location set, the update of the voiceprint set, etc. are similar to this; they will not be described one by one here.

The wake-up method of the embodiment of the present application shown in FIG. 7 calculates the wake-up word confidence of the sound based on the wake-up word model, and further combines the wake-up orientation set and the voiceprint set to determine whether to wake up the smart speaker, so as to output the judgment result in the wake-up word model In order to wake up the smart speaker, the possible false wake-up is further screened, thereby reducing the false wake-up probability of the smart speaker and improving the user experience.

FIG. 8 is a schematic flowchart of another embodiment of the wake-up method provided by the embodiment of the present application. The method can be applied to electronic devices such as the above-mentioned smart speakers. The method can include:

Step 801: Receive the sound, and calculate the wake-up word confidence level of the sound; the wake-up word confidence level is used to describe the probability that the sound includes the wake-up word sound;

Step 802: If the wake-up word confidence is greater than or equal to the first threshold, calculate the sound source orientation of the sound;

Step 803: Determine whether the sound source orientation is in the first orientation set or the second orientation set; wherein, the first orientation set includes several first orientations, and the first orientation is used to record the sound source orientation of the sound that does not wake up the smart speaker ; The second orientation set includes several second orientations, and the second orientation is used to record the sound source orientation of the sound that wakes up the smart speaker;

Step 804: If the sound source azimuth is only in the first azimuth set, determine whether to wake up the smart speaker according to the confidence of the first azimuth corresponding to the sound source azimuth. the probability of speech;

Step 805: If the sound source azimuth is only in the second azimuth set, judge whether to wake up the smart speaker according to the confidence of the second azimuth corresponding to the sound source azimuth. the probability of speech.

The wake-up word confidence level may correspond to the wake-up word confidence level, the first orientation may correspond to the false wake-up orientation, and the second orientation may correspond to the wake-up orientation.

In a possible implementation, it can also include:

If the sound source azimuth is in the first azimuth set and the second azimuth set, or the sound source azimuth is not in the first azimuth set and the second azimuth set, extract the user voiceprint according to the sound;

Determine whether the first voiceprint set includes a user voiceprint; the first voiceprint set includes a first voiceprint, and the first voiceprint is used to record the user voiceprint of the sound that wakes up the electronic device;

Whether to wake up the electronic device is determined according to the confidence of the first voiceprint corresponding to the user's voiceprint.

In a possible implementation manner, before calculating the sound source azimuth of the sound, the method may further include: judging that the confidence level of the wake-up word is less than a second threshold; and the second threshold is greater than the first threshold.

In a possible implementation manner, judging whether to wake up the electronic device according to the confidence of the first azimuth corresponding to the sound source azimuth may include:

Determine whether the confidence of the first azimuth corresponding to the sound source azimuth is less than the threshold a;

If it is less than the threshold value a, the judgment result is not to wake up the electronic device;

If it is not less than the threshold value a, the judgment result is to wake up the electronic device.

In a possible implementation manner, judging whether to wake up the electronic device according to the confidence of the second orientation corresponding to the sound source orientation may include:

Determine whether the confidence of the second azimuth corresponding to the sound source azimuth is less than the threshold b;

If it is less than the threshold b, the judgment result is not to wake up the electronic device;

If it is not less than the threshold value b, the judgment result is to wake up the electronic device.

In a possible implementation manner, judging whether to wake up the electronic device according to the confidence of the first voiceprint corresponding to the user's voiceprint may include:

Determine whether the confidence level of the first voiceprint corresponding to the user's voiceprint is less than a threshold c;

If it is less than the threshold value c, the judgment result is not to wake up the electronic device;

If it is not less than the threshold value c, the judgment result is to wake up the electronic device.

In a possible implementation, it can also include:

If the judgment result is to wake up the electronic device, and the second orientation set includes the sound source orientation of the sound, improve the confidence of the second orientation corresponding to the sound source orientation;

If the determination result is to wake up the electronic device and the sound source orientation of the sound is not included in the second orientation set, the sound source orientation is stored as the second orientation in the second orientation set, and an initial confidence level is set for the second orientation.

In a possible implementation, it can also include:

If the judgment result is to wake up the electronic device, and the user voiceprint of the voice is included in the first voiceprint set, improve the confidence of the first voiceprint corresponding to the user voiceprint;

If the judgment result is to wake up the electronic device, and the first voiceprint set does not include the user voiceprint of the voice, store the user voiceprint as the first voiceprint in the first voiceprint set, and set the initial voiceprint for the first voiceprint Confidence.

In a possible implementation, it can also include:

If the determination result is that the electronic device is not to be woken up, and the sound source orientation of the sound is included in the first orientation set, reducing the confidence level of the first orientation including the sound source orientation;

If the determination result is that the electronic device is not to be woken up and the sound source orientation of the sound is not included in the first orientation set, the sound source orientation is stored as the first orientation in the first orientation set, and an initial confidence level is set for the first orientation.

For the specific implementation of FIG. 8 , reference may be made to the embodiment shown in FIG. 5 , which will not be repeated here.

FIG. 9 is a flowchart of another embodiment of the wake-up method of the present application. The method can be applied to electronic devices such as the above-mentioned smart speakers. The method can include:

Step 901: Receive the sound, and calculate the wake-up word confidence level of the sound; the wake-up word confidence level is used to describe the probability that the sound includes the wake-up word sound;

Step 902: If the wake-up word confidence is greater than or equal to the first threshold, calculate the sound source orientation of the sound;

Step 903: Determine whether the sound source orientation is in the first orientation set; wherein the first orientation set includes the first orientation, and the first orientation is used to record the sound source orientation of the sound that does not wake up the electronic device;

Step 904: If the sound source azimuth is in the first azimuth set, determine whether to wake up the electronic device according to the confidence of the first azimuth corresponding to the sound source azimuth. probability of speech.

In a possible implementation, it can also include:

If the sound source azimuth is not in the first azimuth set, extract the user's voiceprint according to the sound;

In a possible implementation manner, before calculating the sound source orientation of the sound, the method may further include:

It is judged that the confidence level of the wake-up word is less than the second threshold; the second threshold is greater than the first threshold.

In a possible implementation manner, judging whether to wake up the electronic device according to the confidence of the first azimuth corresponding to the sound source azimuth includes:

In a possible implementation, it can also include:

For the specific implementation of FIG. 9 , reference may be made to the embodiment shown in FIG. 6 , which will not be repeated here.

FIG. 10 is a flowchart of another embodiment of a wake-up method provided by an embodiment of the present application. The method can be applied to electronic devices such as the above-mentioned smart speakers. The method can include:

Step 1001: Receive the sound, calculate the wake-up word confidence level of the sound; the wake-up word confidence level is used to describe the probability that the sound includes the wake-up word sound;

Step 1002: If the wake-up word confidence is greater than or equal to the first threshold, calculate the sound source orientation of the sound;

Step 1003: determine whether the sound source orientation is in the second orientation set; wherein, the second orientation set includes the second orientation, and the second orientation is used to record the sound source orientation of the sound that wakes up the electronic device;

Step 1004: If the sound source azimuth is in the second azimuth set, determine whether to wake up the electronic device according to the confidence of the second azimuth corresponding to the sound source azimuth. probability of speech.

In a possible implementation, it can also include:

If the sound source azimuth is not in the second azimuth set, extract the user's voiceprint according to the sound;

In a possible implementation, it can also include:

If the judgment result is to wake up the electronic device, and the sound source orientation of the sound is included in the second orientation set, improve the confidence of the second orientation corresponding to the sound source orientation;

In a possible implementation, it can also include:

If the judgment result is to wake up the electronic device, and the first voiceprint set includes the user's voiceprint of the voice, improve the confidence level of the first voiceprint corresponding to the user's voiceprint;

For the specific implementation of FIG. 10 , reference may be made to the embodiment shown in FIG. 7 , which will not be repeated here.

It can be understood that, some or all of the steps or operations in the foregoing embodiments are merely examples, and other operations or variations of various operations may also be performed in the embodiments of the present application. Furthermore, the various steps may be performed in a different order presented in the above-described embodiments, and may not perform all operations in the above-described embodiments.

FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 11 , the electronic device 1100 may include: a calculation unit 1110 and a judgment unit 1120 .

In one embodiment:

The calculation unit 1110 is used to receive the sound and calculate the wake-up word confidence of the sound; the wake-up word confidence is used to describe the probability that the sound includes the wake-up word sound, if the wake-up word confidence is greater than or equal to the first threshold, then calculate the sound source of the sound position;

The determining unit 1120 is used to determine whether the sound source orientation is in the first orientation set or the second orientation set; wherein the first orientation set includes the first orientation, and the first orientation is used to record the sound source orientation of the sound that does not wake up the electronic device , the second azimuth set includes a second azimuth, and the second azimuth is used to record the sound source azimuth of the sound that wakes up the electronic device; if the sound source azimuth is only in the first azimuth set, according to the confidence level of the first azimuth corresponding to the sound source azimuth To judge whether to wake up the electronic device, the confidence of the first position is used to describe the probability of the voice that wakes up the electronic device at the first position; if the sound source position is only in the second position set, according to the sound source position corresponding to the second position The confidence level is used to determine whether to wake up the electronic device, and the confidence level of the second orientation is used to describe the probability that the voice to wake up the electronic device is issued at the second orientation.

In a possible implementation manner, the judging unit 1120 may also be configured to: if the sound source azimuth is in the first azimuth set and the second azimuth set, or the sound source azimuth is not in the first azimuth set and the second azimuth set, according to Extracting user voiceprints by voice; judging whether the first voiceprint set includes user voiceprints; the first voiceprint set includes a first voiceprint, and the first voiceprint is used to record the user voiceprint of the sound that wakes up the electronic device; The confidence level of the first voiceprint corresponding to the voiceprint determines whether to wake up the electronic device.

In a possible implementation manner, the judging unit 1120 may also be configured to: before calculating the sound source azimuth of the sound, judging that the wake-up word confidence is less than a second threshold; the second threshold is greater than the first threshold.

In a possible implementation manner, the judgment unit 1120 may be specifically configured to: judge whether the confidence of the first azimuth corresponding to the sound source azimuth is less than the threshold a; if it is less than the threshold a, the judgment result is not to wake up the electronic device; Threshold a, the judgment result is to wake up the electronic device.

In a possible implementation manner, the judging unit 1120 may be specifically configured to: judge whether the confidence level of the second azimuth corresponding to the sound source azimuth is less than the threshold b; Threshold b, the judgment result is to wake up the electronic device.

In a possible implementation manner, the judging unit 1120 may be specifically configured to: judge whether the confidence level of the first voiceprint corresponding to the user's voiceprint is less than the threshold c; If it is less than the threshold value c, the judgment result is to wake up the electronic device.

In a possible implementation, it may further include: an update unit, configured to increase the confidence of the second orientation corresponding to the sound source orientation if the determination result is to wake up the electronic device and the second orientation set includes the sound source orientation of the sound If the judgment result is to wake up the electronic device, and the sound source orientation of the sound is not included in the second orientation set, store the sound source orientation as the second orientation in the second orientation set, and set the initial confidence level for the second orientation .

In a possible implementation manner, the updating unit may also be used to: if the judgment result is to wake up the electronic device, and the first voiceprint set includes the user's voiceprint of the voice, improve the confidence of the first voiceprint corresponding to the user's voiceprint If the judgment result is to wake up the electronic device, and the user voiceprint of the voice is not included in the first voiceprint set, the user voiceprint is stored as the first voiceprint in the first voiceprint set, and the first voiceprint is the first voiceprint Set the initial confidence level.

In a possible implementation manner, the updating unit may be further configured to: if the determination result is that the electronic device is not to be woken up, and the sound source azimuth of the sound is included in the first azimuth set, reducing the confidence level of the first azimuth including the sound source azimuth If the judgment result is not to wake up the electronic equipment, and the sound source orientation of the sound is not included in the first orientation set, the sound source orientation is stored as the first orientation in the first orientation set, and the initial confidence level is set for the first orientation .

In another embodiment:

The calculation unit 1110 is used to receive the sound; calculate the wake-up word confidence level of the sound; the wake-up word confidence level is used to describe the probability that the sound includes the wake-up word sound; if the wake-up word confidence level is greater than or equal to the first threshold, then calculate the sound source of the sound position;

The judgment unit 1120 is used to judge whether the sound source azimuth is in the first azimuth set; wherein, the first azimuth set includes the first azimuth, and the first azimuth is used to record the sound source azimuth of the sound that does not wake up the electronic device; if the sound source azimuth In the first set of orientations, whether to wake up the electronic device is determined according to the confidence level of the first orientation corresponding to the sound source orientation, and the confidence level of the first orientation is used to describe the probability that the voice to wake up the electronic device is issued at the first orientation.

In a possible implementation manner, the judging unit 1120 can also be used to: if the sound source azimuth is not in the first azimuth set, extract the user's voiceprint according to the sound; judge whether the user's voiceprint is included in the first voiceprint set; The voiceprint set includes a first voiceprint, and the first voiceprint is used to record the user's voiceprint for waking up the sound of the electronic device; whether to wake up the electronic device is determined according to the confidence of the first voiceprint corresponding to the user's voiceprint.

In a possible implementation, it can also include:

The updating unit is used to improve the confidence of the first voiceprint corresponding to the user's voiceprint if the judgment result is to wake up the electronic device, and the first voiceprint set includes the user's voiceprint of the voice; if the judgment result is to wake up the electronic device, and For user voiceprints that do not include voice in the first voiceprint set, store the user voiceprint as the first voiceprint in the first voiceprint set, and set an initial confidence level for the first voiceprint.

In yet another embodiment:

The judgment unit 1120 is used to judge whether the sound source azimuth is in the second azimuth set; wherein, the second azimuth set includes the second azimuth, and the second azimuth is used to record the sound source azimuth of the sound that wakes up the electronic device; if the sound source azimuth is in In the second azimuth set, whether to wake up the electronic device is determined according to the confidence of the second azimuth corresponding to the sound source azimuth, and the confidence of the second azimuth is used to describe the probability that the voice to wake up the electronic device is issued at the second azimuth.

In a possible implementation manner, the judging unit 1120 may also be used to: if the sound source orientation is not in the second orientation set, extract the user's voiceprint according to the sound; determine whether the first voiceprint set includes the user's voiceprint; The voiceprint set includes a first voiceprint, and the first voiceprint is used to record the user's voiceprint for waking up the sound of the electronic device; whether to wake up the electronic device is determined according to the confidence of the first voiceprint corresponding to the user's voiceprint.

In a possible implementation, it can also include:

The updating unit is used to improve the confidence of the second azimuth corresponding to the sound source azimuth if the judgment result is to wake up the electronic device and the sound source azimuth of the sound is included in the second azimuth set; if the judgment result is to wake up the electronic equipment, and the second azimuth The azimuth set does not include the sound source azimuth of the sound, the sound source azimuth is stored as the second azimuth in the second azimuth set, and an initial confidence level is set for the second azimuth.

The electronic device provided by the embodiment shown in FIG. 11 can be used to implement the technical solutions of the method embodiments shown in FIG. 5 to FIG. 7 of the present application. For the implementation principle and technical effect, reference may be made to the related descriptions in the method embodiments.

It should be understood that the division of each unit of the apparatus shown in FIG. 11 above is only a division of logical functions, and may be fully or partially integrated into a physical entity in actual implementation, or may be physically separated. And these units can all be implemented in the form of software calling through processing elements; they can also all be implemented in hardware; some units can also be implemented in the form of software calling through processing elements, and some units can be implemented in hardware. For example, the acquisition unit may be a separately established processing element, or may be integrated in a certain chip of the electronic device. The implementation of other units is similar. In addition, all or part of these units can be integrated together, and can also be implemented independently. In the implementation process, each step of the above-mentioned method or each of the above-mentioned units may be completed by an integrated logic circuit of hardware in the processor element or an instruction in the form of software.

Embodiments of the present application further provide an electronic device, including: a processor; a memory; and a computer program, wherein the computer program is stored in the memory, and the computer program includes instructions, when the instructions are stored by the device During execution, the device is caused to execute the methods shown in FIG. 5 to FIG. 7 .

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the programs provided by the embodiments shown in FIG. 5 to FIG. 7 of the present application. method.

Embodiments of the present application further provide a computer program product, where the computer program product includes a computer program that, when run on a computer, enables the computer to execute the methods provided by the embodiments shown in FIGS. 5 to 7 of the present application.

Those of ordinary skill in the art can realize that the units and algorithm steps described in the embodiments disclosed herein can be implemented by a combination of electronic hardware, computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, if any function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (Read-Only Memory; hereinafter referred to as: ROM), Random Access Memory (Random Access Memory; hereinafter referred to as: RAM), magnetic disk or optical disk and other various A medium on which program code can be stored.

The above are only specific embodiments of the present application. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A wake-up method, applied to an electronic device comprising a pickup and a speaker, wherein the pickup includes a plurality of microphones, wherein the method includes:

received sound;

Calculate the wake-up word confidence of the sound; the wake-up word confidence is used to represent the probability that the sound includes a wake-up word;

After the wake-up word confidence is greater than or equal to a first threshold, calculate the sound source orientation of the sound;

After the sound source azimuth is matched with a first azimuth in the first azimuth set, and,

Wake up the electronic device after the position reliability of the first party corresponding to the matched first orientation is greater than or equal to the third threshold; or,

Do not wake up the electronic device after the position reliability of the first party corresponding to the first orientation on the match is less than the third threshold;

The wake-up word is used to wake up the electronic device; the sound source orientation is the direction and position of the sound source relative to the electronic device; the first orientation set includes M first orientation elements, each The first position elements include a first position and a first position reliability; the first position is the direction and position of the sound source that wakes up the electronic device relative to the electronic device, and is used to indicate that the The electronic device has been woken up in the first position; the first position reliability is used to represent the probability of waking up the electronic device in the first position; M is a positive integer greater than or equal to 1.
The method of claim 1, wherein:

After the confidence of the wake-up word is greater than or equal to the first threshold, calculating the position of the sound source corresponding to the sound; including:

After the wake-up word confidence level is greater than or equal to the first threshold, and after the wake-up word confidence level is smaller than the second threshold, the location of the sound source corresponding to the sound is calculated.
The method according to claim 1 or 2, wherein the sound source azimuth is matched with a first azimuth in the first azimuth set; comprising:

The direction of the sound source azimuth relative to the electronic device, and the direction of a first azimuth in the first azimuth set relative to the electronic device, the angular deviation of the two directions is within a preset fourth threshold ;and,

The position deviation of the position of the sound source relative to the electronic device and the position of the first position relative to the electronic device is within a preset fifth threshold.
The method according to any one of claims 1-3, wherein the method further comprises:

After the sound source azimuth does not match any first azimuth in the first azimuth set, then

extracting a voiceprint from the voice;

After the voiceprint is matched with a first voiceprint in the first set of voiceprints, and,

Wake up the electronic device after the confidence level of the first voiceprint corresponding to the first voiceprint is greater than or equal to a preset sixth threshold; or,

Do not wake up the electronic device after the confidence level of the first voiceprint corresponding to the first voiceprint is less than a preset sixth threshold;

Wherein, the first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence level, and the first voiceprint is used to represent the voice that wakes up the electronic device The first voiceprint confidence level is used to represent the probability that the first voiceprint wakes up the electronic device; L is a positive integer greater than or equal to 1.
The method according to any one of claims 1-4, wherein after waking up the electronic device, the method further comprises: updating the first orientation set and the first voiceprint set.
The method according to any one of claims 2-5, wherein,

After the wake-up word confidence is greater than or equal to a second threshold, the electronic device is woken up, and the first orientation set and the first voiceprint set are updated.
A wake-up method, applied to an electronic device comprising a pickup and a speaker, wherein the pickup includes a plurality of microphones, wherein the method includes:

received sound;

Calculate the wake-up word confidence of the sound; the wake-up word confidence is used to represent the probability that the sound includes a wake-up word;

After the wake-up word confidence is greater than or equal to a first threshold, calculate the sound source orientation of the sound;

After the sound source azimuth is matched with a second azimuth in the second set of azimuths, and,

After the second party position reliability corresponding to the matched second orientation is greater than or equal to the seventh threshold, wake up the electronic device; or,

After the position reliability of the second party corresponding to the second orientation on the match is less than the seventh threshold, the electronic device is not awakened;

The wake-up word is used to wake up the electronic device; the sound source orientation is the direction and position of the sound source relative to the electronic device; the second orientation set includes N second orientation elements, each The second orientation elements include a second orientation and a second orientation confidence, where the second orientation is the direction and position of the sound source that does not wake up the electronic device relative to the electronic device, and is used to indicate where the electronic device is located. The second location does not wake up the electronic device, and the second location reliability is used to represent the probability that the electronic device is not woken up in the second location; N is a positive integer greater than or equal to 1.
The method according to claim 7, wherein the sound source azimuth is matched with a second azimuth in the second azimuth set; comprising:

The direction of the sound source azimuth relative to the electronic device, and the direction of a second azimuth in the second azimuth set relative to the electronic device, the angular deviation of the two directions is within a preset eighth threshold ;and,

The position deviation of the position of the sound source relative to the electronic device and the position of the second position relative to the electronic device is within a preset ninth threshold.
The method according to claim 7 or 8, wherein the method further comprises:

After the sound source azimuth does not match any second azimuth in the second azimuth set, then

extracting a voiceprint from the voice;

After the voiceprint does not match any one of the first voiceprints in the first voiceprint set, update the second orientation set;

Wherein, the first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence level, and the first voiceprint is used to represent the voice that wakes up the electronic device The first voiceprint confidence level is used to represent the probability that the first voiceprint wakes up the electronic device; L is a positive integer greater than or equal to 1.
The method according to claim 7 or 8, wherein the method further comprises:

After the sound source azimuth does not match any second azimuth in the second azimuth set, then

extracting a voiceprint from the voice;

After the voiceprint is matched with a first voiceprint in the first set of voiceprints, and,

After the first voiceprint confidence level corresponding to the first voiceprint is greater than or equal to a preset tenth threshold, wake up the electronic device; or,

After the first voiceprint confidence level corresponding to the first voiceprint is less than a preset tenth threshold, the electronic device is not woken up, and the second orientation set is updated;

Wherein, the first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence, and the first voiceprint confidence is used to represent the first voiceprint The probability that a voiceprint wakes up the electronic device, the first voiceprint is used to represent the voiceprint for waking up the electronic device; L is a positive integer greater than or equal to 1.
The method according to any one of claims 7-10, wherein,

After waking up the electronic device, the method further includes: updating the first voiceprint set;

After not waking up the electronic device, the method further includes updating the second set of orientations.
The method according to any one of claims 9-11, wherein,

After the wake-up word confidence is greater than or equal to a second threshold, the electronic device is woken up, and the first voiceprint set is updated.
A wake-up method, applied to an electronic device comprising a pickup and a speaker, wherein the pickup includes a plurality of microphones, wherein the method includes:

received sound;

Calculate the wake-up word confidence of the sound; the wake-up word confidence is used to represent the probability that the sound includes a wake-up word;

After the wake-up word confidence is greater than or equal to a first threshold, calculate the sound source orientation of the sound;

after the sound source orientation matches one of the second orientations in the second set of orientations, and after the sound source orientation does not match any of the first orientations in the first set of orientations, and

After the position reliability of the second party corresponding to the matched second orientation is greater than or equal to the eleventh threshold, wake up the electronic device; or,

After the second party position reliability corresponding to the second orientation on the match is less than the eleventh threshold, the electronic device is not awakened;

The wake-up word is used to wake up the electronic device; the sound source orientation is the direction and position of the sound source relative to the electronic device; the first orientation set includes M first orientation elements, each The first position elements include a first position and a first position reliability; the first position is the direction and position of the sound source that wakes up the electronic device relative to the electronic device, and is used to indicate that the The first position has woken up the electronic device; the first position reliability is used to represent the probability of waking up the electronic device in the first position; the second position set includes N second position elements, each The second position elements include a second position and a second position reliability; the second position is the direction and position of the sound source that does not wake up the electronic device relative to the electronic device, and is used to indicate The second position does not wake up the electronic device; the second position reliability is used to represent the confidence that the electronic device is not woken up in the second position; M and N are both positive integers greater than or equal to 1.
The method of claim 13, wherein:

The sound source azimuth is matched with a second azimuth in the second azimuth set; including:

The direction of the sound source azimuth relative to the electronic device, and the direction of a second azimuth in the second azimuth set relative to the electronic device, the angular deviation of the two directions is within a preset twelfth threshold. within; and,

the position of the sound source azimuth relative to the electronic device, and the position of the second azimuth relative to the electronic device, the position deviation of the two positions is within a preset thirteenth threshold;

The sound source azimuth does not match any first azimuth in the first azimuth set; including:

The direction of the sound source azimuth relative to the electronic device, and the direction of any first azimuth in the first azimuth set relative to the electronic device, the angular deviation of the two directions is not within the preset fourteenth within the threshold; and,

The position of the sound source azimuth relative to the electronic device, and the position of any first azimuth in the first azimuth set relative to the electronic device, the position deviation of the two positions is not within the preset fifteenth within the threshold.
The method of claim 13, wherein the method further comprises:

after the sound source orientation matches one of the first orientations in the first set of orientations, and after the sound source orientation does not match any second orientation in the second set of orientations, and

After the position reliability of the first party corresponding to the matching first orientation is greater than or equal to the sixteenth threshold, wake up the electronic device; or,

After the position reliability of the first party corresponding to the matched first orientation is less than the sixteenth threshold, the electronic device is not awakened.
The method of claim 15, wherein:

The sound source azimuth is matched with a first azimuth in the first azimuth set; including:

The direction of the sound source azimuth relative to the electronic device, and the direction of a first azimuth in the first azimuth set relative to the electronic device, the angular deviation of the two directions is within a preset fourteenth threshold. within; and,

the position of the sound source azimuth relative to the electronic device, and the position of the first azimuth relative to the electronic device, the position deviation of the two positions is within a preset fifteenth threshold;

The sound source orientation does not match any second orientation in the second orientation set; including:

The direction of the sound source azimuth relative to the electronic device, and the direction of any second azimuth in the second azimuth set relative to the electronic device, the angular deviation of the two directions is not within the preset twelfth direction. within the threshold; and,

The position of the sound source azimuth relative to the electronic device, and the position of any second azimuth in the second azimuth set relative to the electronic device, the position deviation of the two positions is not within the preset thirteenth. within the threshold.
The method of claim 13, wherein the method further comprises:

After the sound source azimuth does not match any second azimuth in the second azimuth set, and after the sound source azimuth does not match any first azimuth in the first azimuth set, then

extracting a voiceprint from the voice;

After the voiceprint is matched with a first voiceprint in the first set of voiceprints, and,

After the first voiceprint confidence level corresponding to the first voiceprint is greater than or equal to a preset sixteenth threshold, wake up the electronic device, and update the first orientation combination and the first voiceprint set; or,

After the confidence level of the first voiceprint corresponding to the first voiceprint is less than a preset sixteenth threshold, the electronic device is not woken up, and the second orientation set is updated;

Wherein, the first voiceprint set includes L voiceprint elements, each voiceprint element includes a first voiceprint and a first voiceprint confidence, and the first voiceprint confidence is used to represent the first voiceprint The probability that a voiceprint wakes up the electronic device, the first voiceprint is used to represent the voiceprint for waking up the electronic device; L is a positive integer greater than or equal to 1.
The method of claim 17, wherein the method further comprises:

After the voiceprint does not match any one of the first voiceprints in the first voiceprint set, the second orientation set is updated.
The method according to any one of claims 13-16, wherein the method further comprises:

After waking up the electronic device, updating the first set of orientations;

After not waking up the electronic device, the second set of orientations is updated.
An electronic device comprising a pickup and a speaker, wherein the pickup includes a plurality of microphones, wherein the electronic device further includes:

processor;

memory;

and a computer program, wherein the computer program is stored in the memory and, when executed by the processor, causes the electronic device to perform the method of any one of claims 1-19.
A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a computer program, which, when the computer program is run on an electronic device, causes the electronic device to execute any one of claims 1-19 The method of clause 1, wherein the electronic device includes a pickup and a speaker, the pickup including a plurality of microphones.