CN111192590A

CN111192590A - Voice wake-up method, device, equipment and storage medium

Info

Publication number: CN111192590A
Application number: CN202010072558.6A
Authority: CN
Inventors: 杨程
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-05-22
Anticipated expiration: 2040-01-21
Also published as: CN111192590B

Abstract

The embodiment of the invention discloses a voice awakening method, a voice awakening device, voice awakening equipment and a storage medium. The method comprises the following steps: acquiring voice information in real time, and identifying an environment type corresponding to the voice information; determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types; and inputting the voice information into the target awakening model to detect awakening words, and executing equipment awakening operation when the awakening words are determined to be detected. By the technical scheme, higher awakening rate and lower false awakening rate can be considered simultaneously under different scenes.

Description

Voice wake-up method, device, equipment and storage medium

Technical Field

Embodiments of the present invention relate to information processing technologies, and in particular, to a voice wake-up method, apparatus, device, and storage medium.

Background

With the popularization of intelligent electronic devices, many electronic devices have a voice wake-up function.

At present, a voice wake-up method mainly detects whether a voice contains a wake-up word by simply adopting a wake-up model based on a deep neural network, and if the wake-up word is detected, a device is woken up, otherwise, the device keeps silent. The technical scheme has excellent performance in quiet scenes, has lower false wake-up rate and higher wake-up rate, but has less prominent performance in some noise scenes, and is easy to have the condition of high false wake-up rate or low wake-up rate. These noise scenes, which are common in life, mainly include a strong noise store, a talking office, a room playing tv or music, etc., and also present challenges to the voice wake-up technique.

In addition, in order to pursue a lower false wake-up rate and a higher wake-up rate in the prior art, the adaptive capacity of the neural network model in different scenes must be improved, which is usually realized by changing the model structure and increasing the size of the model, so as to improve the adaptive capacity of the model in different scenes.

Disclosure of Invention

Embodiments of the present invention provide a voice wake-up method, apparatus, device, and storage medium, so as to achieve a higher wake-up rate and a lower false wake-up rate in different scenarios at the same time.

In a first aspect, an embodiment of the present invention provides a voice wake-up method, including:

acquiring voice information in real time, and identifying an environment type corresponding to the voice information;

determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types;

and inputting the voice information into the target awakening model to detect awakening words, and executing equipment awakening operation when the awakening words are determined to be detected.

In a second aspect, an embodiment of the present invention further provides a voice wake-up apparatus, where the apparatus includes:

the environment recognition module is used for acquiring voice information in real time and recognizing an environment type corresponding to the voice information;

the model determining module is used for determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types;

and the voice detection module is used for inputting the voice information into the target awakening model to detect the awakening words and executing equipment awakening operation when the awakening words are determined to be detected.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the voice wake-up method according to any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the voice wake-up method according to any one of the embodiments of the present invention.

According to the embodiment of the invention, the voice information is obtained in real time, the environment type corresponding to the voice information is recognized, the target wake-up model is determined from the preset wake-up model group according to the environment type corresponding to the voice information, the voice information is further input into the target wake-up model to detect the wake-up word, and the equipment wake-up operation is executed when the wake-up word is determined, so that the advantage that different wake-up models are selected according to different environments to detect the wake-up word in the voice information is utilized, the problem that in the prior art, due to the fact that one model is used for adapting to various scenes, higher wake-up rate and lower false wake-up rate cannot be simultaneously considered in different scenes is solved, and the effects of simultaneously considering higher wake-up rate and lower false wake-up rate in different scenes are realized.

Drawings

Fig. 1 is a flowchart illustrating a voice wake-up method according to an embodiment of the present invention;

fig. 2a is a schematic flowchart of a voice wake-up method according to a second embodiment of the present invention;

fig. 2b is a schematic flow chart of a device wake-up method according to a second embodiment of the present invention;

fig. 3a is a schematic flowchart of a voice wake-up method according to a third embodiment of the present invention;

fig. 3b is a schematic flow chart of a device wake-up method applicable to the third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a voice wake-up apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart illustrating a voice wake-up method according to an embodiment of the present invention. The method is applicable to voice wake-up of electronic equipment, and can be executed by a voice wake-up device, which can be composed of hardware and/or software, and can be generally integrated in mobile phones, tablets and all electronic equipment containing voice wake-up function. The method specifically comprises the following steps:

and S110, acquiring the voice information in real time, and identifying the environment type corresponding to the voice information.

In this embodiment, the voice information may be information obtained by collecting the voice of the user through a self-contained or external sound collector (e.g., a microphone) on the device. Since the user may be in different scenes, the voice information acquired in real time carries an environmental sound, and the corresponding environmental type can be identified through the voice information. The environment type includes, but is not limited to, a street, an office, a room, a market, etc., and is not limited thereto. Specifically, several typical environment types can be preset according to actual needs, and then the environment type corresponding to the voice information acquired in real time is identified according to the characteristics of each environment type.

Optionally, acquiring the voice information in real time includes: carrying out voice endpoint detection on external environment sound; and if the voice signal is detected from the external environment sound, acquiring the voice information corresponding to the voice signal in real time.

For example, Voice endpoint Detection may be performed on the external environment sound through a VAD (Voice Activity Detection) module, which aims to identify and eliminate a long silence period from a sound signal stream to determine whether a Voice signal is detected in the external environment sound, and when the Voice signal is not detected, a subsequent identification process is not performed, and when the Voice signal is determined to be detected, Voice information corresponding to the Voice signal is obtained in real time. Specifically, the process of acquiring the voice information may be a process of converting a voice signal in the external environment sound into voltage information or current information that can be recognized by the electronic device.

Optionally, acquiring the voice information corresponding to the voice signal in real time includes: and acquiring the voice information in real time from a preset moment before the voice signal is detected to appear until the voice signal disappears or the equipment is awakened.

In order to ensure that all voice information is acquired, the voice information can be acquired in real time by tracing forward for a period of time when the voice signal is detected, namely starting at a preset time before the voice signal appears, and finishing after the voice signal disappears is detected or finishing when the equipment is awakened before the voice signal disappears is detected.

For example, the voice information is acquired in real time starting 1s before the time point of detecting the voice signal until the voice signal disappears or ending when the device has been awakened.

S120, determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types.

In this embodiment, a corresponding wake-up model may be set for each preset environment type. Once the environment type corresponding to the voice message is determined, an awakening model corresponding to the environment type may be selected from an awakening model group including a plurality of awakening models, so as to perform detection of an awakening word based on the voice message acquired in an actual scene of the environment type. Specifically, the wake-up model included in the wake-up model group may be obtained by using voice samples including wake-up words obtained in different scenes to respectively train a preset neural Network, where the preset neural Network may be, for example, a Feed-Forward Sequential Memory Network (FSMN).

Each awakening model is specially customized according to a specific scene, so that a higher awakening rate and a lower false awakening rate can be ensured in the specific scene, and the user experience is greatly improved. In addition, for each awakening model, one model only needs to deal with one application scene, so that the size of the model is smaller than that of the awakening model with a multi-scene mixture, and the running space of a chip is saved.

As an actual example, if the environment type corresponding to the voice message is an office, that is, the user is most likely to be in the office when sending the voice command, the wake-up model corresponding to the office type may be selected from the preset wake-up model group to be used as a target wake-up model, so as to perform detection of the wake-up word by using the target wake-up model in the subsequent steps.

S130, inputting the voice information into the target awakening model to detect the awakening words, and executing equipment awakening operation when the awakening words are determined to be detected.

In this embodiment, the wake-up word may be a preset keyword for waking up the device, such as a device name and/or a combination of the device name and a specific word, and is not limited herein. For example, the wake-up models may be stored on a high power chip to improve operating efficiency. Specifically, a plurality of wake-up models corresponding to different scenes are stored on the high-power-consumption chip, the corresponding target wake-up model is called according to the recognized environment type, and other wake-up models are kept silent and do not occupy the running space. And carrying out awakening word detection on the incoming voice information through the target awakening model, wherein if the awakening word is detected, the equipment is awakened, otherwise, the equipment cannot be awakened. The device wake-up operation includes, but is not limited to, lighting up a display screen, starting an unlocking recognition module, and the like, which is not limited herein.

According to the technical scheme, the voice information is obtained in real time, the environment type corresponding to the voice information is recognized, the target awakening model is determined from the preset awakening model group according to the environment type corresponding to the voice information, the voice information is input into the target awakening model to detect the awakening word, the equipment awakening operation is executed when the awakening word is determined to be detected, the advantage that different awakening models are selected according to different environments to detect the awakening word in the voice information is utilized, the problem that in the prior art, due to the fact that one model is used for adapting to multiple scenes, the problem that high awakening rate and low false awakening rate cannot be simultaneously considered in different scenes is solved, and the effect that high awakening rate and low false awakening rate are simultaneously considered in different scenes is achieved.

Example two

Fig. 2a is a flowchart illustrating a voice wake-up method according to a second embodiment of the present invention. The present embodiment is optimized based on the above embodiments, and provides a preferred voice wakeup method, specifically, further optimizing the environment type corresponding to the recognized voice information includes: inputting voice information into a preliminary detection model in real time to perform preliminary detection on the awakening words, simultaneously inputting the voice information into an environment recognition model in real time, and outputting to obtain an environment type corresponding to the voice information; correspondingly, inputting voice information into the target awakening model to detect the awakening words, and executing equipment awakening operation to further optimize when the awakening words are detected, wherein the method comprises the following steps: if the awakening word is preliminarily determined to be detected, acquiring an awakening word segment corresponding to the voice information; and inputting the awakening word segment corresponding to the voice information into the target awakening model to detect the awakening word, and executing equipment awakening operation when the awakening word is determined to be detected. The method specifically comprises the following steps:

s210, voice information is obtained in real time, the voice information is input into the preliminary detection model in real time to carry out preliminary detection on the awakening word, meanwhile, the voice information is input into the environment recognition model in real time, and the environment type corresponding to the voice information is output.

In this embodiment, on the basis of the above embodiment, a process of primarily detecting the wake-up word is added, that is, when the voice information is acquired, the voice information is first input to the primary detection model to perform primary detection on the wake-up word, where the primary detection is coarse-grained detection on the voice information, for example, all suspected wake-up words are determined as wake-up words, so as to improve the wake-up rate of the device. When the voice information is input into the preliminary detection model, the voice information is input into the environment recognition model in real time, the voice information and the environment recognition model are simultaneously input, the awakening delay can be reduced, and the instantaneity of awakening word detection is guaranteed. The environment recognition model can be a trained small neural network model added on a low-power chip and used for determining the use scene of the electronic device.

For example, compared with the wake-up model, the preliminary detection model has lower requirements for calculation, so the preliminary detection model may be set on a low-power chip as a primary model, and each wake-up model included in the wake-up model group may be set on a high-power chip as a secondary model. Specifically, the preliminary detection model may be obtained by training a preset Neural Network using a speech sample containing a wake-up word, where the preset Neural Network may be, for example, DNN (Deep Neural Network), RNN (Recurrent Neural Network), and the like, which is not limited herein.

In an actual example, the VAD module is configured to continuously detect external environment sound, and if a voice is detected, transmit acquired audio information to the primary model in real time from a fixed length before a voice detection time point for preliminary detection of a wakeup word, and simultaneously send the audio information to the environment recognition model in real time for determining a current environment type of the device. The primary model and the environment recognition model are connected in parallel, almost run simultaneously, and the time efficiency is very high.

S220, determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types.

And S230, if the awakening words are preliminarily determined to be detected, acquiring awakening word segments corresponding to the voice information.

For example, if it is detected that the voice message includes a wakeup word through the preliminary detection model, a wakeup word segment corresponding to the voice message may be first obtained, where the wakeup word segment may be a voice segment including the wakeup word in the voice message.

Optionally, obtaining a wakeup word segment corresponding to the voice information includes: determining a time point when the awakening word is preliminarily detected; and acquiring an information segment between a first preset time length before a time point and a second preset time length after the time point in the voice information as a wake-up word segment corresponding to the voice information.

The first preset time length and the second preset time length may be the same or different, and are not limited herein. As an actual example, in the continuous voice information detection process, if the primary model detects a wakeup word, a voice segment near the time point (for example, 1.5 seconds before and after each time, and 3 seconds in total) of the detected wakeup word is intercepted and used as the wakeup word segment of the voice information. It should be noted that a segment of voice message may include one wakeup word segment, and may also include multiple wakeup word segments, which is not limited herein.

S240, inputting the awakening word segment corresponding to the voice information into the target awakening model to detect the awakening word, and executing equipment awakening operation when the awakening word is determined to be detected.

For example, for the wake-up word segment obtained during the initial detection, the secondary wake-up word recognition may be performed on the wake-up word segment corresponding to the voice information in a targeted manner, where the secondary detection is to perform fine-grained detection on the wake-up word segment in the voice information, for example, all suspected wake-up words that are not wake-up words are excluded, so that the false wake-up rate of the device may be reduced.

In an actual example, an audio stream acquired in real time is transmitted into a primary model to perform initial detection of a wakeup word, and simultaneously, the audio stream is transmitted into an environment type detection model to perform environment type identification, and a detected result is used for selecting a proper secondary model. If the first-level model detects the awakening words, sending the awakening word segments in the audio stream and the environment type detection result into the second-level model to select a proper second-level model to re-identify the awakening words for the awakening word segments; if the primary model does not detect the awakening words, no information is transmitted to the high-power-consumption chip, and the high-power-consumption chip is still in a silent state, so that the running resources of the chip are saved.

On the basis of this embodiment, refer to the schematic diagram of the device wake-up process shown in fig. 2b, wherein the process of inputting the voice information to the primary model in real time to perform the preliminary detection of the wake-up word and the process of inputting the voice information to the environment recognition model in real time to perform the recognition of the environment type are performed simultaneously, that is, the primary model and the environment recognition model are connected in parallel.

According to the technical scheme, voice information is obtained in real time, the voice information is input into a primary detection model in real time to perform primary detection on a wakeup word, meanwhile, the voice information is input into an environment recognition model in real time, an environment type corresponding to the voice information is output, a target wakeup model is determined from a preset wakeup model group according to the environment type corresponding to the voice information, if the wakeup word is preliminarily determined to be detected, a wakeup word segment corresponding to the voice information is obtained, the wakeup word segment corresponding to the voice information is input into the target wakeup model to perform detection on the wakeup word, and equipment wakeup operation is performed when the wakeup word is determined to be detected. The voice message awakening method has the advantages that different awakening models are selected according to different environments to detect awakening words in the voice message, and the environment types are identified while primary detection is carried out, so that the higher awakening rate and the lower false awakening rate are considered simultaneously in different scenes, the equipment awakening delay is reduced, and the equipment awakening real-time performance is guaranteed.

EXAMPLE III

Fig. 3a is a flowchart illustrating a voice wake-up method according to a third embodiment of the present invention. The present embodiment is optimized based on the above embodiments, and provides a preferred voice wakeup method, specifically, further optimizing the environment type corresponding to the recognized voice information includes: inputting voice information to a preliminary detection model in real time to perform preliminary detection on the awakening words; if the awakening word is preliminarily determined to be detected, acquiring an awakening word segment corresponding to the voice information; inputting the awakening word segment corresponding to the voice information into an environment recognition model, and outputting to obtain an environment type corresponding to the voice information; correspondingly, inputting voice information into the target awakening model to detect the awakening words, and executing equipment awakening operation to further optimize when the awakening words are detected, wherein the method comprises the following steps: and inputting the awakening word segment corresponding to the voice information into the target awakening model to detect the awakening word, and executing equipment awakening operation when the awakening word is determined to be detected. The method specifically comprises the following steps:

and S310, acquiring voice information in real time, and inputting the voice information to the preliminary detection model in real time to perform preliminary detection on the awakening words.

In this embodiment, on the basis of the first embodiment, a process of primarily detecting the wake-up word is added, that is, when the voice information is acquired, the voice information is first input to the primary detection model to perform primary detection on the wake-up word, where the primary detection is coarse-grained detection on the voice information, for example, all suspected wake-up words are determined as wake-up words, so as to improve the wake-up rate of the device.

For an exemplary description of the preliminary detection model, reference may be made to the second embodiment described above, and details are not repeated herein.

In an actual example, the VAD module is configured to continuously detect external environment sound, and if a voice is detected, transmit audio to the primary model from a fixed length before a voice detection time point for detecting a wakeup word, and store a section of audio with a specific length for subsequently determining an environment type of a current scene of the device.

S320, if the awakening words are preliminarily determined to be detected, acquiring the awakening word segments corresponding to the voice information.

In this embodiment, the manner of acquiring the wakeup word segment is the same as that in the second embodiment, and is not described herein again.

S330, inputting the awakening word segment corresponding to the voice information into the environment recognition model, and outputting to obtain the environment type corresponding to the voice information.

For example, the description of the environment recognition model is the same as that in the second embodiment, and is not repeated herein. Wherein, because the process of discernment environment type just goes on after the preliminary detection process of the word of awaking, in addition, because the low-power consumption chip need keep the operation always, consequently, this embodiment only runs a preliminary detection model on this low-power consumption chip, can reduce the consumption of chip, practices thrift the operation space of chip.

In an actual example, if a wake-up word is detected by the primary model, sending the wake-up word segment in the voice message to a high-power chip for further processing, specifically, firstly, performing environment type identification on the wake-up word segment by the environment type detection model, judging which secondary model is used subsequently, then selecting a proper secondary model according to a detection result to perform wake-up word detection on the wake-up word segment, if a wake-up word is detected, the device is woken up, otherwise, the device is not woken up.

On the basis of this embodiment, refer to the schematic diagram of the device wake-up process shown in fig. 3b, where voice information is input to the primary model to perform initial detection of a wake-up word, and when it is determined that the wake-up word is detected, a wake-up word segment of the voice information is input to the environment recognition model to perform environment type recognition, and these two processes are performed sequentially, that is, the primary model and the environment recognition model are connected in series.

S340, determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types.

And S350, inputting the awakening word segments corresponding to the voice information into the target awakening model to detect the awakening words, and executing equipment awakening operation when the awakening words are determined to be detected.

According to the technical scheme, voice information is obtained in real time, the voice information is input into a primary detection model in real time to perform primary detection on the awakening word, if the awakening word is preliminarily determined to be detected, awakening word segments corresponding to the voice information are obtained, the awakening word segments corresponding to the voice information are input into an environment recognition model, an environment type corresponding to the voice information is output, a target awakening model is determined from a preset awakening model group according to the environment type corresponding to the voice information, the awakening word segments corresponding to the voice information are input into the target awakening model to perform detection on the awakening word, and equipment awakening operation is executed when the awakening word is determined to be detected. The voice recognition method has the advantages that different awakening models are selected according to different environments to detect awakening words in voice information, and after the awakening words are detected preliminarily, the environment types are recognized, so that the higher awakening rate and the lower false awakening rate are considered simultaneously in different scenes, the power consumption of a chip is reduced, and the running space of the chip is saved.

On the basis of the foregoing embodiments, optionally, the method further includes: obtaining an environmental sound sample with an environmental type label; and training the set neural network model according to the environmental sound sample and the corresponding environmental type label to obtain an environmental recognition model.

For example, different environment types can be set for different use scenes, a plurality of environment sound samples are respectively collected for different environment types, corresponding environment type labels are attached, when an environment recognition model under a specific environment type is trained, the plurality of environment sound samples under the environment type labels can be used for training a set neural network model, and when model parameters reach the optimal values, the model parameters can be used as the environment recognition model. The set neural network model may be, for example, an RNN model, a DNN model, or the like, and is not limited herein.

Example four

Fig. 4 is a schematic structural diagram of a voice wake-up apparatus according to a fourth embodiment of the present invention. Referring to fig. 4, the voice wake-up apparatus includes: an environment recognition module 410, a model determination module 420, and a speech detection module 430, each of which is described in detail below.

The environment recognition module 410 is configured to obtain voice information in real time and recognize an environment type corresponding to the voice information;

a model determining module 420, configured to determine a target wake-up model from a preset wake-up model group according to an environment type corresponding to the voice message, where the wake-up model group includes at least two wake-up models, and different wake-up models correspond to different environment types;

a voice detection module 430, configured to input the voice information into the target wake-up model to detect a wake-up word, and execute a device wake-up operation when it is determined that the wake-up word is detected.

The voice wake-up device provided by this embodiment, by acquiring the voice information in real time, recognize the environment type corresponding to the voice information, and determine the target wake-up model from the preset wake-up model group according to the environment type corresponding to the voice information, and then input the voice information to the target wake-up model to perform detection of the wake-up word, and execute the device wake-up operation when determining the detected wake-up word, the advantage of selecting different wake-up models according to different environments to detect the wake-up word in the voice information is utilized, the problem that a higher wake-up rate and a lower false wake-up rate cannot be simultaneously considered in different scenes due to the use of one model to adapt to multiple scenes in the prior art is solved, and the effect of simultaneously considering a higher wake-up rate and a lower false wake-up rate in different scenes is achieved.

Optionally, the environment recognition module 410 may specifically include:

the endpoint detection submodule is used for carrying out voice endpoint detection on external environment sound;

and the information acquisition submodule is used for acquiring the voice information corresponding to the voice signal in real time if the voice signal is detected from the external environment sound.

Optionally, the information obtaining sub-module may be specifically configured to:

and acquiring voice information in real time from a preset moment before the voice signal is detected to appear until the voice signal disappears or the equipment is awakened.

Optionally, the environment recognition module 410 may be further specifically configured to:

inputting the voice information into a preliminary detection model in real time to perform preliminary detection on the awakening words, simultaneously inputting the voice information into an environment recognition model in real time, and outputting to obtain an environment type corresponding to the voice information;

correspondingly, the voice detection module 430 may be specifically configured to:

if the awakening word is preliminarily determined to be detected, acquiring an awakening word segment corresponding to the voice information;

and inputting the awakening word segment corresponding to the voice information into the target awakening model to detect the awakening word, and executing equipment awakening operation when the awakening word is determined to be detected.

inputting the voice information to a preliminary detection model in real time to perform preliminary detection on the awakening words;

inputting the awakening word segment corresponding to the voice information into an environment recognition model, and outputting to obtain an environment type corresponding to the voice information;

correspondingly, inputting the voice information into the target awakening model to detect an awakening word, and executing equipment awakening operation when the awakening word is determined to be detected, wherein the method comprises the following steps:

Optionally, the obtaining of the wakeup word segment corresponding to the voice message may specifically further include:

determining a time point when the wake-up word is preliminarily detected;

and acquiring an information segment between a first preset time length before the time point and a second preset time length after the time point in the voice information, and taking the information segment as a wake-up word segment corresponding to the voice information.

Optionally, the apparatus may further include:

the sample acquisition module is used for acquiring an environmental sound sample with an environmental type label;

and the model training module is used for training a set neural network model according to the environmental sound sample and the corresponding environmental type label to obtain the environmental recognition model.

The product can execute the method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention, and as shown in fig. 5, the electronic device according to the present embodiment includes: a processor 51 and a memory 52. The number of the processors in the electronic device may be one or more, fig. 5 illustrates one processor 51, the processor 51 and the memory 52 in the electronic device may be connected by a bus or in other manners, and fig. 5 illustrates the connection by a bus.

In this embodiment, the processor 51 of the electronic device is integrated with the voice wake-up apparatus provided in the above embodiments. In addition, the memory 52 of the electronic device serves as a computer-readable storage medium for storing one or more programs, which may be software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the voice wake-up method in the embodiment of the present invention (for example, the modules in the voice wake-up apparatus shown in fig. 4 include the environment recognition module 410, the model determination module 420, and the voice detection module 430). The processor 51 executes various functional applications and data processing of the device by executing software programs, instructions and modules stored in the memory 52, namely, implements the voice wake-up method in the above method embodiment.

The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 52 may further include memory located remotely from the processor 51, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And, when the one or more programs included in the above electronic device are executed by the one or more processors 51, the programs perform the following operations:

acquiring voice information in real time, and identifying an environment type corresponding to the voice information; determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types; and inputting the voice information into the target awakening model to detect awakening words, and executing equipment awakening operation when the awakening words are determined to be detected.

EXAMPLE six

The sixth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a voice wake-up apparatus, implements a voice wake-up method according to the first embodiment of the present invention, where the method includes: acquiring voice information in real time, and identifying an environment type corresponding to the voice information; determining a target awakening model from a preset awakening model group according to the environment type corresponding to the voice information, wherein the awakening model group comprises at least two awakening models, and different awakening models correspond to different environment types; and inputting the voice information into the target awakening model to detect awakening words, and executing equipment awakening operation when the awakening words are determined to be detected.

Of course, the computer-readable storage medium provided in the embodiments of the present invention is not limited to implement the method operations described above when the computer program stored on the storage medium is executed, and may also implement the relevant operations in the voice wake-up method provided in any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the voice wake-up apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A voice wake-up method, comprising:

2. The method of claim 1, wherein obtaining voice information in real-time comprises:

carrying out voice endpoint detection on external environment sound;

and if the voice signal is detected from the external environment sound, acquiring the voice information corresponding to the voice signal in real time.

3. The method of claim 2, wherein obtaining the voice information corresponding to the voice signal in real time comprises:

4. The method of claim 1, wherein identifying the environment type corresponding to the voice message comprises:

5. The method of claim 1, wherein identifying the environment type corresponding to the voice message comprises:

6. The method according to claim 4 or 5, wherein obtaining the wakeup word segment corresponding to the voice message comprises:

determining a time point when the wake-up word is preliminarily detected;

7. The method of claim 4 or 5, further comprising:

obtaining an environmental sound sample with an environmental type label;

and training a set neural network model according to the environmental sound sample and the corresponding environmental type label to obtain the environmental recognition model.

8. A voice wake-up apparatus, comprising:

9. An electronic device, characterized in that the device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the voice wake-up method as recited in any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the voice wake-up method according to any one of claims 1 to 7.