WO2020244257A1

WO2020244257A1 - Method and system for voice wake-up, electronic device, and computer-readable storage medium

Info

Publication number: WO2020244257A1
Application number: PCT/CN2020/076473
Authority: WO
Inventors: 李波; 夏波; 詹昌寿
Original assignee: 湖南国声声学科技股份有限公司
Priority date: 2019-06-06
Filing date: 2020-02-24
Publication date: 2020-12-10
Also published as: CN110265036A

Abstract

Provided are a method and system for voice wake-up, an electronic device, and a computer-readable storage medium, related to the technical field of smart devices. The method comprises: if a voice activity detection process detects that a voice signal collected by a microphone satisfies a first preset criterion, then triggering a processor to read a data signal collected by a gravity sensor (S201); determining, on the basis of the data signal, whether the voice signal is generated by a user wearing an electronic device talking (S202); if not, then ignoring the voice signal (S203); or, if yes, then activating a keyword recognition process; and, if the voice signal is recognized as comprising a preset voice command keyword, then controlling the electronic device to execute a corresponding function (S204). The method saves the power consumption of the electronic device and prevents the electronic device from being waked up by mistake. In addition, this obviates the need to employ a dedicated bone conduction microphone or other contact microphones, is inexpensive, has a simple and practical algorithm, is highly accurate, and consumes less resources.

Description

Voice wake-up method, system, electronic equipment and computer readable storage medium

Technical field

The invention of this application belongs to the technical field of voice processing, and in particular relates to a voice wake-up method, system, electronic device, and computer-readable storage medium.

Background technique

With the development of science and technology, various electronic devices currently generally have a voice wake-up function. By presetting a wake-up word in the device or software, when the user issues the voice command, the device is awakened from the sleep state.

The traditional language wake-up solution uses voice activity detection (Voice Activity Detection, VAD) to obtain the audio signal collected by the microphone, and calculate the voice energy according to the audio signal. When the voice energy is greater than a preset threshold, the processor is triggered to start keyword recognition , To determine whether the above audio signal is a language instruction issued by the user. This voice wake-up solution does not consider whether the audio signal collected by the microphone is caused by the wearer's speech, which leads to the situation of false wake-up of the device, that is, when people around unintentionally say a keyword, it will also trigger the wake-up of the device, and In a noisy environment, VAD will continue to trigger and cause the digital signal processor to perform keyword recognition, which will result in a large power loss.

In view of the above-mentioned shortcomings of the traditional language wake-up solution, the existing language wake-up solution generally first determines whether the audio signal collected by the microphone is caused by the wearer's speech before triggering the processor to perform keyword recognition. However, in the prior art, a special bone conduction microphone or other contact microphones are generally used to extract audio signals to determine whether the audio signals are caused by the wearer’s speech, because bone conduction microphones and other contact microphones are expensive. , Resulting in high overall equipment costs. In addition, the prior art also uses software algorithms to determine whether the microphone audio signal is caused by the wearer's speech, but the determination algorithm is generally more complicated, which causes the determination itself to consume more resources.

In summary, traditional and existing voice wake-up solutions may cause false wake-up of the device, large power loss, high device cost, and complex judgment algorithms, which cause the judgment itself to consume more resources.

technical problem

In view of this, the embodiments of the present invention provide a voice wake-up method, system, electronic device, and computer-readable storage medium to solve the above-mentioned voice wake-up solution that may cause false wake-up of the device, large power loss, and equipment cost bias. High and the judgment algorithm is more complicated, leading to the problem that the judgment itself consumes more resources.

Technical solutions

The first aspect of the embodiments of the present invention provides a voice wake-up method, which is applied to an electronic device. The electronic device includes a processor, a gravity sensor, and a microphone. The gravity sensor and the microphone are electrically connected to the processor. Connect, the voice wake-up method includes using the processor to execute the following steps:

If the voice activity detection process detects that the voice signal collected by the microphone meets the first preset condition, trigger the processor to read the data signal collected by the gravity sensor;

Judging whether the voice signal is generated by the user wearing the electronic device speaking according to the data signal;

If the voice signal is not generated by the user's speech, ignore the voice signal; or,

If the voice signal is generated by the user's speech, a keyword recognition process is initiated, and if it is recognized that the voice signal contains a preset voice command keyword, the electronic device is controlled to perform a corresponding function.

A second aspect of the embodiments of the present invention provides a voice wake-up system, which is applied to an electronic device, the electronic device includes a processor, a gravity sensor, and a microphone, and the gravity sensor and the microphone are respectively electrically connected to the processor, The voice wake-up system includes:

A voice activity detection unit, configured to trigger the processor to read the data signal collected by the gravity sensor if the voice activity detection process detects that the voice signal collected by the microphone meets the first preset condition;

The first determining unit is configured to determine, according to the data signal, whether the voice signal is generated by the speech of the user wearing the electronic device;

The execution unit is used to ignore the voice signal if the voice signal is not generated by the user's speech; or, if the voice signal is generated by the user's speech, start a keyword recognition process If it is recognized that the voice signal contains a preset voice command keyword, the electronic device is controlled to perform a corresponding function.

The third aspect of the embodiments of the present invention provides an electronic device, including a gravity sensor, a microphone, a memory, a processor, and a computer program stored in the memory and running on the processor. The gravity sensor, Both the microphone and the memory are electrically connected to the processor, and the processor implements the steps of the voice wake-up method according to any one of the embodiments of the first aspect when the processor executes the computer program.

The fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it can implement any of the above-mentioned embodiments of the first aspect. The steps of the voice wake-up method described in item. .

Beneficial effect

According to the voice wake-up method, system, electronic device, and computer-readable storage medium provided by the embodiments of the present invention, when the voice activity detection process monitors that the voice signal collected by the microphone meets the first preset condition, it is further determined whether the voice signal is When a user wearing an electronic device speaks, the keyword recognition process is started only when it is recognized that the above-mentioned voice signal is generated by the user's speech, thereby saving the energy consumption of the electronic device and avoiding false wake-up of the electronic device; in addition, Because the data signal collected by the gravity sensor of the electronic device is used to determine whether the voice signal is generated by the user wearing the electronic device, there is no need to use a special bone conduction microphone or other contact microphone, and the cost is low, and The algorithm is simple and practical, with high accuracy and low resource consumption.

Description of the drawings

In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only of the present invention. For some embodiments, for those of ordinary skill in the art, other drawings can be obtained from these drawings without creative labor.

Figure 1 is a structural block diagram of an electronic device provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a specific implementation process of a voice wake-up method according to Embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of a specific implementation flow of the voice wake-up method provided in the second embodiment of the present invention;

4 is a schematic structural diagram of a voice wake-up system provided by Embodiment 3 of the present invention;

Fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. .

Embodiments of the invention

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are proposed for a thorough understanding of the embodiments of the present invention. However, it should be clear to those skilled in the art that the present invention can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, interface switching devices, circuits, and methods are omitted to avoid unnecessary details from obstructing the description of the present invention.

In order to illustrate the technical solution of the present invention, specific embodiments are used for description below.

Fig. 1 is a structural block diagram of an electronic device provided by an embodiment of the present invention. For convenience of description, only the parts related to this embodiment are shown.

As shown in FIG. 1, an electronic device 100 provided by an embodiment of the present invention includes a processor 103, a gravity sensor 101, and a microphone 102, and the gravity sensor 101 and the microphone 102 are electrically connected to the processor 103, respectively.

Wherein, the electronic device 100 includes, but is not limited to, smart wearable devices such as earphones. The microphone 102 is a common, low-cost type microphone 102 built in the electronic device 100. The gravity sensor 101 is a sensor of the electronic device 100 that is used to determine the wearing state and realize the double-click function.

Based on the structure of the above electronic device 100, the following embodiments of the present invention are proposed.

Example one

FIG. 2 is a schematic diagram of a specific implementation flow of the voice wake-up method provided in the first embodiment of the present invention. The method is applied to the electronic device 100 shown in FIG. 1, and its execution body is the processor 103 in the electronic device 100 shown in FIG. 1. As shown in FIG. 2, the voice wake-up method provided in this embodiment may include the following steps:

In step S201, if the voice activity detection process detects that the voice signal collected by the microphone 102 meets the first preset condition, the processor 103 is triggered to read the data signal collected by the gravity sensor 101.

In a specific implementation, the first preset condition is that the speech energy of the speech signal is greater than a second preset energy threshold. Wherein, the voice activity detection process and the microphone 102 remain on when the electronic device 100 is in the standby state. When the microphone 102 collects a voice signal, the voice signal is transferred to the voice activity detection process, and the voice activity detection process The process detects the voice energy of the voice signal, and when the voice energy is greater than the second preset energy threshold, triggers the processor 103 to read the data signal collected by the gravity sensor 101.

In this embodiment, since the processor 103 is triggered to read the data signal collected by the gravity sensor 101 only when the voice energy of the detected voice signal is greater than the second preset energy threshold, it can avoid the noisy Under the environment, the processor 103 is repeatedly triggered to perform a process of determining whether the voice signal is generated by the speech of the user wearing the electronic device 100, which further saves the power consumption of the terminal.

Step S202: Determine, according to the data signal, whether the voice signal is generated by the speech of the user wearing the electronic device 100; if the voice signal is not generated by the speech of the user, go to step S203; if the The voice signal is generated by the user's speech, then jump to step S204.

In this embodiment, the electronic device 100 is worn on the head of the user. When the user is speaking and not speaking, the vibration frequency and amplitude of the maxilla and mandible of the head are different, resulting in different data signals collected by the gravity sensor 101. Therefore, by analyzing the data signal collected by the gravity sensor 101, it can be determined whether the voice signal is generated by the speech of the user wearing the electronic device 100.

In a specific implementation, the judging whether the voice signal is generated by the user wearing the electronic device 100 speaking according to the data signal includes:

Performing time-frequency conversion on the data signal to filter out data signals with a frequency within a preset frequency range;

Count the signal energy of data signals with frequencies within the preset frequency range in the frequency domain;

Judging whether the signal energy is greater than a first preset energy threshold;

If the signal energy is greater than the first preset energy threshold, it means that the voice signal is generated by the user wearing the electronic device 100 speaking;

If the signal energy is less than or equal to the first preset energy threshold, it means that the voice signal is not generated by the user wearing the electronic device 100 speaking.

In another specific implementation manner, the judging whether the voice signal is generated by the user wearing the electronic device 100 according to the data signal includes:

Performing band-pass filtering processing on the data signal to filter out data signals with a frequency within a preset frequency range;

Statistic the signal energy of the data signal whose frequency is within the preset frequency range in the time domain;

If it is greater than the first preset energy threshold, it means that the voice signal is generated by the speech of the user wearing the electronic device 100;

It should be noted that the first preset energy threshold in the above two specific implementations is obtained through a large amount of training and learning in advance, and is used to distinguish whether the voice signal is an energy threshold generated by the speech of the user wearing the electronic device. When the user wearing the electronic device speaks, the vibration frequency and amplitude of the upper and lower jaw bones of the user's head are relatively large, and the signal energy of the data signal collected by the gravity sensor 101 is too large, so the signal energy of the data signal is greater than When the first preset energy threshold, it means that the voice signal is generated by the user wearing the electronic device speaking; conversely, if the signal energy is less than or equal to the first preset energy threshold, it means that The vibration frequency and amplitude of the upper jaw and the lower jaw of the user's head are relatively small, which indicates that the voice signal is not generated by the speech of the user wearing the electronic device. The preset frequency range is the frequency range of the voice signal when a person speaks. Specifically, the preset frequency range is 300~3000 Hz. Since the frequency range of the voice signal when a person speaks is different from the frequency range of the environmental noise, in this embodiment, only statistics within the preset frequency range, that is, the voice signal when the person speaks The signal energy in the frequency band can filter out the influence of other noise energy on the judgment result, making the judgment result more accurate.

Step S203: Ignore the voice signal.

In this embodiment, if the voice signal is not caused by the speech of the user wearing the electronic device 100, it means that the voice signal is a voice signal generated by the noise of the surrounding environment or the speech of other people, and it is not caused by wearing the electronic device 100. Therefore, the voice signal is ignored and no keyword recognition is performed on the voice signal, which can save the power consumption of the electronic device 100.

In step S204, a keyword recognition process is started, and if it is recognized that the voice signal contains a preset voice command keyword, the electronic device 100 is controlled to perform a corresponding function.

In this embodiment, if the voice signal is caused by the speech of the user wearing the electronic device 100, it means that the voice signal may be a voice control instruction input by the user. Therefore, the keyword recognition process is further initiated to identify all Whether the voice signal contains preset voice command keywords, if it contains the preset voice command keywords, the electronic device 100 is controlled to perform the corresponding voice control function; otherwise, if the preset voice command keywords are not included, then It indicates that the voice signal is generated by the user's speech, but is not a voice control command, so the voice signal is ignored and the electronic device 100 is not awakened.

It can be seen from the above that the voice wake-up method provided in this embodiment, when the voice activity detection process monitors that the voice signal collected by the microphone 102 meets the first preset condition, further determines whether the voice signal is from the user wearing the electronic device 100 When it is recognized that the above-mentioned voice signal is generated by the user’s speech, the keyword recognition process is started, which can save the energy consumption of the electronic device 100, and can prevent the electronic device 100 from waking up by mistake; The data signal collected by the gravity sensor 101 of the electronic device 100 is used to determine whether the voice signal is generated by the speech of the user wearing the electronic device 100, so there is no need to use a special bone conduction microphone or other contact microphones, and the cost is low, and The algorithm is simple and practical, with high accuracy and low resource consumption.

To

Example two

FIG. 3 is a schematic diagram of a specific implementation flow of the voice wake-up method provided in the second embodiment of the present invention. The method is applied to the electronic device 100 shown in FIG. 1, and its execution body is the processor 103 in the electronic device 100 shown in FIG. 1. Referring to FIG. 3, the voice wake-up method provided in this embodiment may include the following steps:

Step S301: Determine whether the voice activity detection process monitors whether the voice signal collected by the microphone 102 meets the first preset condition, and if it meets the first preset condition, enter step S302-1 and step S302-2 at the same time. The specific implementation of this step is the same as the implementation of Embodiment 1, and will not be repeated here.

Step S302-1: Determine whether to start the keyword recognition process in advance according to the degree of word loss allowed by the keyword recognition process and the activation speed of the microphone 102; if the degree of word loss allowed by the keyword recognition process and the start speed of the microphone 102 If the second preset condition is met, step S303-1 is entered.

Step S303-1: Start the keyword recognition process, and recognize whether the voice signal contains preset voice command keywords.

Wherein, the first preset condition is that the speech energy of the speech signal is greater than a second preset energy threshold. Wherein, the voice activity detection process and the microphone 102 remain on when the electronic device 100 is in the standby state. When the microphone 102 collects a voice signal, the voice signal is transferred to the voice activity detection process, and the voice activity detection process The process detects the voice energy of the voice signal, and when the voice energy is greater than the second preset energy threshold, triggers the processor 103 to read the data signal collected by the gravity sensor 101.

The second preset energy threshold is an energy threshold set in advance in order to prevent the processor from being repeatedly triggered to determine whether the voice signal is spoken by the user wearing the electronic device. Since the processor 103 is triggered to read the data signal collected by the gravity sensor 101 when the speech energy of the speech signal is greater than the second preset energy threshold, it can avoid the processor 103 in a noisy environment. 103 is repeatedly triggered to perform the process of determining whether the voice signal is generated by the speech of the user wearing the electronic device 100, which further saves the power consumption of the terminal.

The second preset condition is that the degree of word loss allowed by the keyword recognition process is less than a preset threshold of word loss and the activation speed of the microphone 102 is less than the preset activation speed threshold.

In this embodiment, when the degree of word loss allowed by the keyword recognition process is less than the preset threshold of word loss and the activation speed of the microphone 102 is less than the preset activation speed threshold, the process proceeds to step S303-1 to enable the key in advance. Word recognition process, which can avoid the slow start of microphone 102. If the keyword recognition process is not started in advance, too many words will be lost, that is, the voice signal sent by the user cannot be collected in time, and the complete voice control cannot be recognized. In the case of instructions; on the contrary, if the degree of word loss allowed by the keyword recognition process is greater than or equal to the preset degree of word loss or the microphone 102 activation speed is greater than or equal to the preset activation speed threshold, the possibility of loss of voice control commands is less likely Therefore, the keyword recognition process is not started in advance. The wake-up process in this case is the same as the voice wake-up process provided in the first embodiment, so it is not repeated here.

Step S302-2, trigger the processor 103 to read the data signal collected by the gravity sensor 101;

Step S303-2: Determine whether the voice signal is generated by the user wearing the electronic device 100 speaking according to the data signal. It should be noted that the implementation manners of step S302-2 and step S303-2 are the same as the implementation manners of the corresponding steps in Embodiment 1, so they will not be repeated here.

Step S304: If the voice signal includes preset voice command keywords and the voice signal is generated by the user's speech, the electronic device 100 is controlled to perform a corresponding function.

Step S305: If the voice signal does not include the preset voice command keywords or the voice signal is not generated by the user's speech, the voice signal is ignored.

In this embodiment, only when the voice signal satisfies the preset voice command keywords and is generated by the user wearing the electronic device 100, the electronic device 100 is controlled to perform the corresponding voice control function. When the signal does not meet any of the above two conditions, the voice signal is ignored, so that false wake-up of the electronic device 100 can be avoided.

It can be seen from the above that the voice wake-up method provided by this embodiment can also avoid false wake-up of the electronic device 100, and the data signal collected by the gravity sensor 101 of the electronic device 100 is used to determine whether the voice signal is caused by wearing the electronic device. 100 users speak, so there is no need to use a special bone conduction microphone or other contact microphones, the cost is low, and the algorithm is simple and practical, the accuracy rate is high, and the resource consumption is small; in addition, compared to the previous embodiment, this embodiment The provided voice wake-up method starts the keyword recognition process in advance when the allowable word loss degree of the keyword recognition process is less than the preset word loss degree threshold and the activation speed of the microphone 102 is less than the preset activation speed threshold, which can avoid Because the microphone 102 is too slow to start, if the keyword recognition process is not started in advance, too many words are lost and the complete voice control command cannot be recognized.

To

Example three

FIG. 4 is a schematic structural diagram of a voice wake-up system provided by Embodiment 3 of the present invention. The system is applied to the electronic device 100 described in FIG. 1 and runs in the processor 103 of the electronic device 100 described in FIG. 1. For convenience of description, only the parts related to this embodiment are shown.

Referring to FIG. 4, the voice wake-up system 4 provided in this embodiment includes:

The voice activity detection unit 41 is configured to trigger the processor 103 to read the data signal collected by the gravity sensor 101 if the voice activity detection process detects that the voice signal collected by the microphone 102 meets the first preset condition ; Wherein, the first preset condition is that the speech energy of the speech signal is greater than a second preset energy threshold.

The first determining unit 42 is configured to determine, according to the data signal, whether the voice signal is generated by the speech of the user wearing the electronic device 100;

The execution unit 43 is configured to ignore the voice signal if the voice signal is not generated by the user's speech; or, if the voice signal is generated by the user's speech, start keyword recognition During the process, if it is recognized that the voice signal contains a preset voice command keyword, the electronic device 100 is controlled to perform a corresponding function.

Optionally, the first determining unit 42 is specifically configured to:

If the signal energy is less than or equal to the first preset energy threshold, it means that the voice signal is not generated by the speech of the user wearing the electronic device 100;

Wherein, the first preset energy threshold value is obtained through a large amount of training and learning in advance, and is used to distinguish whether the voice signal is an energy threshold value generated by the speech of the user wearing the electronic device. When the user wearing the electronic device speaks, the vibration frequency and amplitude of the upper and lower jaw bones of the user's head are relatively large, and the signal energy of the data signal collected by the gravity sensor 101 is too large, so the signal energy of the data signal is greater than When the first preset energy threshold, it means that the voice signal is generated by the user wearing the electronic device speaking; conversely, if the signal energy is less than or equal to the first preset energy threshold, it means that The vibration frequency and amplitude of the upper jaw and the lower jaw of the user's head are relatively small, which indicates that the voice signal is not generated by the speech of the user wearing the electronic device.

Alternatively, the first determining unit 42 is specifically configured to:

Optionally, the voice wake-up system further includes:

The second determining unit 44 is configured to determine whether to start the keyword recognition process in advance according to the degree of word loss allowed by the keyword recognition process and the activation speed of the microphone 102;

The execution unit 43 is further configured to:

If the degree of word loss allowed by the keyword recognition process and the activation speed of the microphone 102 meet the second preset condition, the keyword recognition process is started in advance. At this time, the keyword recognition process and the detection of whether the voice signal is worn by the wearer The speech generation process of the user of the electronic device 100 is performed synchronously;

If the voice signal contains preset voice command keywords and the voice signal is generated by the user's speech, control the electronic device 100 to perform the corresponding function; or,

If the voice signal does not include preset voice command keywords or the voice signal is not generated by the user's speech, then the voice signal is ignored.

Optionally, the second preset condition is that the degree of word loss allowed by the keyword recognition process is less than a preset degree of word loss threshold and the activation speed of the microphone 102 is less than a preset activation speed threshold.

Optionally, the first preset condition is that the speech energy of the speech signal is greater than a second preset energy threshold. Wherein, the second preset energy threshold is an energy threshold preset in order to prevent the processor from being repeatedly triggered to determine whether the voice signal is spoken by the user wearing the electronic device. Since the processor 103 is triggered to read the data signal collected by the gravity sensor 101 when the speech energy of the speech signal is greater than the second preset energy threshold, it can avoid the processor 103 in a noisy environment. 103 is repeatedly triggered to perform the process of determining whether the voice signal is generated by the speech of the user wearing the electronic device 100, which further saves the power consumption of the terminal.

It should be noted that the various units of the above-mentioned system provided by the embodiments of the present invention are based on the same concept as the method embodiments of the present invention, and their technical effects are the same as those of the method embodiments of the present invention. For details, please refer to the method implementation of the present invention. The description in the example will not be repeated here.

A person of ordinary skill in the art can understand that all or some of the steps in the method disclosed in this embodiment can be implemented as software, firmware, hardware, and appropriate combinations thereof.

To

Example four

FIG. 5 is a schematic structural diagram of an electronic device 100 according to Embodiment 4 of the present invention. For convenience of description, only the parts related to this embodiment are shown.

As shown in FIG. 5, the electronic device 100 provided in this embodiment includes a gravity sensor 101, a microphone 102, a memory 104, a processor 103, and a computer program 105 that is stored in the memory 104 and can run on the processor 103. , The gravity sensor 101, the microphone 102, and the memory 104 are all electrically connected to the processor 103, and the processor 103 executes the computer program 105 to implement the first or second embodiment described above The steps of the voice wake-up method. Wherein, the electronic device 100 includes, but is not limited to, smart wearable devices such as earphones.

The electronic device 100 of this embodiment belongs to the same concept as the voice wake-up method of the first embodiment or the second embodiment. The specific implementation process is detailed in the method embodiment, and the technical features in the method embodiment correspond to the device embodiment. Applicable, so I won’t repeat it here.

To

Example five

The fifth embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements the voice wake-up method described in the first or second embodiment above A step of.

The computer-readable storage medium of this embodiment belongs to the same concept as the voice wake-up method of the above-mentioned embodiment 1 or embodiment 2, and the specific implementation process is detailed in the method embodiment, and the technical features in the method embodiment are in the device embodiment They are all applicable, so I won’t repeat them here.

In hardware implementations, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may consist of several physical components. The components are executed cooperatively. Some physical components or all physical components can be implemented as software executed by the processor 103, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as a dedicated integrated circuit. Circuit. Such software may be distributed on a computer-readable medium, and the computer-readable medium may include a computer storage medium (or a non-transitory medium) and a communication medium (or a transitory medium). As is well known by those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile implementations in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Flexible, removable and non-removable media. Computer storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, tapes, magnetic disk storage or other magnetic storage devices, or Any other medium used to store desired information and that can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media usually contain computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as carrier waves or other transmission mechanisms, and may include any information delivery media .

The preferred embodiments of the present invention are described above with reference to the accompanying drawings, and the scope of rights of the present invention is not limited thereby. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and essence of the present invention shall fall within the scope of the rights of the present invention. .

Industrial applicability

According to the voice wake-up method, system, electronic device, and computer-readable storage medium provided by the embodiments of the present invention, when the voice activity detection process monitors that the voice signal collected by the microphone meets the first preset condition, it is further determined whether the voice signal is When a user wearing an electronic device speaks, the keyword recognition process is started only when it is recognized that the above-mentioned voice signal is generated by the user's speech, thereby saving the energy consumption of the electronic device and avoiding false wake-up of the electronic device; in addition, Because the data signal collected by the gravity sensor of the electronic device is used to determine whether the voice signal is generated by the user wearing the electronic device, there is no need to use a special bone conduction microphone or other contact microphone, and the cost is low, and The algorithm is simple and practical, with high accuracy and low resource consumption. Therefore, it has industrial applicability. .

Claims

A voice wake-up method is applied to an electronic device. The electronic device includes a processor, a gravity sensor and a microphone. The gravity sensor and the microphone are electrically connected to the processor. The voice wake-up method includes The processor performs the following steps:

If the voice activity detection process detects that the voice signal collected by the microphone meets the first preset condition, trigger the processor to read the data signal collected by the gravity sensor;

Judging whether the voice signal is generated by the user wearing the electronic device speaking according to the data signal;

If the voice signal is not generated by the user's speech, ignore the voice signal; or,

If the voice signal is generated by the user's speech, a keyword recognition process is initiated, and if it is recognized that the voice signal contains a preset voice command keyword, the electronic device is controlled to perform a corresponding function.
The language wake-up method according to claim 1, wherein said judging whether said voice signal is generated by the speech of a user wearing said electronic device according to said data signal comprises:

Performing time-frequency conversion on the data signal to filter out data signals with a frequency within a preset frequency range;

Count the signal energy of data signals with frequencies within the preset frequency range in the frequency domain;

Judging whether the signal energy is greater than a first preset energy threshold;

If the signal energy is greater than the first preset energy threshold, it means that the voice signal is generated by the speech of the user wearing the electronic device;

If the signal energy is less than or equal to the first preset energy threshold, it means that the voice signal is not generated by the speech of the user wearing the electronic device.
The voice wake-up method according to claim 1, wherein said judging whether said voice signal is generated by a user wearing said electronic device speaking according to said data signal comprises:

Performing band-pass filtering processing on the data signal to filter out data signals with a frequency within a preset frequency range;

Statistic the signal energy of the data signal whose frequency is within the preset frequency range in the time domain;

Judging whether the signal energy is greater than a first preset energy threshold;

If it is greater than the first preset energy threshold, it means that the voice signal is generated by the speech of the user wearing the electronic device;

If the signal energy is less than or equal to the first preset energy threshold, it means that the voice signal is not generated by the speech of the user wearing the electronic device.
The voice wake-up method according to claim 1, wherein if the voice activity detection process running on the processor detects that the voice signal collected by the microphone meets the first preset condition, the method further comprises:

Judging whether to start the keyword recognition process in advance according to the degree of word loss allowed by the keyword recognition process and the microphone activation speed;

If the degree of word loss allowed by the keyword recognition process and the microphone activation speed meet the second preset condition, the keyword recognition process is started in advance. At this time, the keyword recognition process and detecting whether the voice signal is worn by the The process of speech generated by the user of the electronic device is synchronized;

If the voice signal contains preset voice command keywords and the voice signal is generated by the user's speech, control the electronic device to perform the corresponding function; or,

If the voice signal does not include preset voice command keywords or the voice signal is not generated by the user's speech, then the voice signal is ignored.
The voice wake-up method of claim 4, wherein the second preset condition is that the degree of word loss allowed by the keyword recognition process is less than a preset word loss threshold and the microphone activation speed is less than the preset activation speed Threshold.
The voice wake-up method according to claim 1, wherein the first preset condition is that the voice energy of the voice signal is greater than a second preset energy threshold.
A voice wake-up system is applied to an electronic device. The electronic device includes a processor, a gravity sensor, and a microphone. The gravity sensor and the microphone are respectively electrically connected to the processor. The voice wake-up system includes:

A voice activity detection unit, configured to trigger the processor to read the data signal collected by the gravity sensor if the voice activity detection process detects that the voice signal collected by the microphone meets the first preset condition;

The first determining unit is configured to determine, according to the data signal, whether the voice signal is generated by the speech of the user wearing the electronic device;

The execution unit is used to ignore the voice signal if the voice signal is not generated by the user's speech; or, if the voice signal is generated by the user's speech, start a keyword recognition process If it is recognized that the voice signal contains a preset voice command keyword, the electronic device is controlled to perform a corresponding function.
The language wake-up system according to claim 7, wherein the first judgment unit is specifically configured to:

Performing time-frequency conversion on the data signal to filter out data signals with a frequency within a preset frequency range;

Count the signal energy of data signals with frequencies within the preset frequency range in the frequency domain;

Judging whether the signal energy is greater than a first preset energy threshold;

If the signal energy is greater than the first preset energy threshold, it means that the voice signal is generated by the speech of the user wearing the electronic device;

If the signal energy is less than or equal to the first preset energy threshold, it means that the voice signal is not generated by the speech of the user wearing the electronic device;

Or, the first judgment unit is specifically configured to:

Performing band-pass filtering processing on the data signal to filter out data signals with a frequency within a preset frequency range;

Statistic the signal energy of the data signal whose frequency is within the preset frequency range in the time domain;

Judging whether the signal energy is greater than a first preset energy threshold;

If it is greater than the first preset energy threshold, it means that the voice signal is generated by the speech of the user wearing the electronic device;

If the signal energy is less than or equal to the first preset energy threshold, it means that the voice signal is not generated by the speech of the user wearing the electronic device.
An electronic device including a gravity sensor, a microphone, a memory, a processor, and a computer program stored in the memory and running on the processor. The gravity sensor, the microphone, and the memory are all related to The processor is electrically connected, and when the processor executes the computer program, the steps of the voice wake-up method according to any one of claims 1 to 6 are realized.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the voice wake-up method according to any one of claims 1 to 6 are realized.