CN112652306A

CN112652306A - Voice wake-up method and device, computer equipment and storage medium

Info

Publication number: CN112652306A
Application number: CN202011599330.9A
Authority: CN
Inventors: 匡勇建
Original assignee: Zhuhai Jieli Technology Co Ltd
Current assignee: Zhuhai Jieli Technology Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-13
Anticipated expiration: 2040-12-29
Also published as: CN112652306B

Abstract

The application relates to a voice wake-up method, a voice wake-up device, computer equipment and a storage medium. The method comprises the following steps: acquiring awakening word voice; obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model; searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval; and if the decoding path score is larger than the set threshold, waking up the electronic equipment. According to the method, the scores of the decoding paths can be reduced and the awakening word voices under abnormal conditions can be filtered by punishing the modeling units which are not in the conventional time interval in the decoding paths. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.

Description

Voice wake-up method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech wake-up method and apparatus, a computer device, and a storage medium.

Background

With the development of voice recognition technology, voice awakening is widely applied to intelligent product ends, such as intelligent home equipment, vehicle-mounted electronics, intelligent sound boxes, mobile phones, Bluetooth headsets and the like. Voice wakeup mainly involves three major parts: audio preprocessing, acoustic models, keyword decoding. The keyword decoding is relatively independent, but has a significant influence on voice awakening of the whole system.

Keyword decoding is a process of locating the starting and ending time points of a given keyword from a speech stream, where a keyword is a word, typically a noun or phrase, that can express some substantial meaning. The keyword decoding is mostly performed by designing a decoding network, so the design of the decoding network is particularly critical. The existing decoding network can obtain better effect through reasonable design, but when some special situations are met, if the awakening word is overlapped word and the speed of the keyword changes, the accuracy rate of voice awakening is greatly reduced. For example, in voice wake-up decoding, the keywords are overlapped words, such as small-degree keywords, small-music keywords, and the like, and the decoding network is easy to wake up when the user yells half of the keywords; in addition, when the speech rate is too fast or too slow, the performance of the decoding network may also be degraded.

Namely, the existing voice awakening method has the problem of low awakening accuracy.

Disclosure of Invention

In view of the above, it is desirable to provide a voice wake-up method, apparatus, computer device and storage medium capable of improving the wake-up accuracy.

A voice wake-up method, the method comprising:

acquiring awakening word voice;

obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;

searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;

and if the decoding path score is larger than the set threshold, waking up the electronic equipment.

In one embodiment, the method further comprises: and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval.

In one embodiment, the frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and the penalty function of each modeling unit is constructed according to the statistical result, including:

aligning the awakening word data set through the second acoustic model, and carrying out frequency statistics on the duration of each modeling unit based on the alignment result;

performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit according to the distribution curve;

and constructing a penalty function according to the distribution curve, wherein the penalty function penalizes a modeling unit which is not in a conventional duration interval.

In one embodiment, aligning the wakeup word data set through the second acoustic model, and performing frequency statistics on the duration of each modeling unit based on the alignment result includes:

aligning the awakening word data set through the second acoustic model to obtain a decoding path corresponding to each awakening word in the data set;

and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.

In one embodiment, the method further comprises:

a modeling unit for determining a wakeup word;

and constructing a decoding network according to a modeling unit, wherein the modeling unit is added with tracking information.

In one embodiment, searching a decoding path in a decoding network of the wakeup word by using the posterior probability, and punishing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path includes:

searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path;

and obtaining a decoding path score according to the penalty function of each modeling unit and the first score, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.

A voice wake-up apparatus, the apparatus comprising:

the voice acquisition module is used for acquiring the voice of the awakening word;

the recognition module is used for obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;

the decoding module is used for searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;

and the awakening module is used for awakening the electronic equipment if the decoding path score is greater than the set threshold value.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring awakening word voice;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring awakening word voice;

According to the voice awakening method, the voice awakening device, the computer equipment and the storage medium, the posterior probability of each frame of the awakening word is obtained through the acoustic model and is used as the input of the decoding network, the decoding path is searched in the decoding network, the modeling unit which is not in the conventional time interval is punished, and the modeling unit which is not in the conventional time interval in the decoding path is punished, so that the score of the decoding path can be reduced, and the awakening word voice under the abnormal condition can be filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a voice wake-up method;

FIG. 2 is a flow chart illustrating a voice wake-up method according to an embodiment;

FIG. 3 is a schematic diagram of a decoding network in one embodiment;

FIG. 4 is a schematic flow chart diagram illustrating the steps for constructing a penalty function in one embodiment;

FIG. 5 is a schematic illustration of a gamm curve fit to the alignment results in one embodiment;

FIG. 6 is a block diagram of a voice wake-up unit in one embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voice wake-up method provided by the present application can be applied to the application environment shown in fig. 1. Wherein the user inputs a wake-up voice to the terminal 102. The terminal 102 acquires the voice of the awakening word; obtaining the posterior probability of each frame of the awakening word through a pre-trained acoustic model; searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval; and if the decoding path score is larger than the set threshold, waking up the electronic equipment.

The terminal 102 may be, but is not limited to, a smart home device, a vehicle-mounted electronic device, a smart speaker, a mobile phone, and a bluetooth headset.

In one embodiment, as shown in fig. 2, a voice wake-up method is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:

step 202, obtaining the voice of the awakening word.

Specifically, the electronic device is provided with a microphone for collecting the voice of the wake-up word. The wake-up word refers to a voice instruction for waking up the electronic device from a low power consumption state.

And 204, obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model.

The first acoustic model is obtained by pre-training, is not limited to a traditional acoustic model algorithm or a neural network-based acoustic model algorithm, and is an acoustic model for improving voice wakeup in the application. The probability value that the features of each frame of the awakening word belong to a certain modeling unit can be obtained through the first acoustic model. The modeling unit corresponds to a phoneme or a word, a phrase (keyword). Taking a phoneme as an example, the probability value of the feature of each speech frame of the wake-up word belonging to a certain phoneme can be obtained through the first acoustic model. Specifically, in the process of identifying the first acoustic model, the wake-up word is firstly framed, and the probability value of the characteristic of each frame belonging to each modeling unit is predicted to obtain the posterior probability of each frame of the wake-up word.

And step 206, searching a decoding path in the decoding network of the awakening word by using the posterior probability, punishing the decoding path according to the punishment function of each modeling unit to obtain the score of the decoding path, wherein the punishment function punishs the modeling unit which is not in the conventional time interval.

Wherein the decoding network structure is constructed according to specific wake-up keywords and specific modeling units (each phoneme). And aiming at the specific awakening words, constructing a decoding network according to the corresponding modeling units, wherein the number of the modeling units is at least one. One modeling unit corresponds to one phoneme or one word, one phrase (keyword). Specifically, a modeling unit of a wake-up word is determined; and constructing a decoding network according to the modeling unit, and adding tracking information into the modeling unit. The modeling unit is not limited to phonemes, and may be a word, an awakening word group, or the like, the phoneme of each awakening word may be used as the modeling unit to construct a decoding network, or each word of the awakening word may be used as the modeling unit to construct the decoding network, and the modeling unit adds tracking information.

Taking the awakening word as "small beep and small beep" as an example, taking the phoneme "x, iao, d, u" as a modeling unit, constructing a decoding network as shown in fig. 3, wherein the vertical axis of the coordinate axis is the structure of the decoding network, and the decoding network comprises the state corresponding to each modeling unit, the starting state(s) and the ending state (e), gbg and sil respectively correspond to the non-keyword phoneme and the mute phoneme; x, iao, d, u, x, iao, d, u correspond to a phoneme of a keyword "small beep", development of the keyword phoneme in a time dimension is as indicated by an arrow in fig. 3, the arrow is directed to a direction in which a voice signal detection path is developed over time

Specifically, forward reasoning is performed in the decoding network to perform path searching. The forward reasoning of the decoding network takes the posterior probability of each frame of the first acoustic model as input data, and the posterior probability of the first acoustic model can be output by a common traditional acoustic model algorithm or can be output by an acoustic model based on a neural network.

In the forward reasoning process, a decoding network is used for carrying out keyword detection on input voice, when matched keyword information is detected, the current decoding path length is calculated, the duration time of the phonemes is obtained, and the decoding path length is obtained until all keywords are detected, so that the duration time of each phoneme is obtained.

And in the decoding path searching process, punishment is carried out on the modeling unit which is not in the conventional time length interval. For example, the regular time duration interval of a certain phoneme is 2-5 frames, and if the phoneme time duration corresponding to the current wake-up word speech determined by the decoding path is 6 frames and is not in the regular time duration interval, punishment is performed on the phoneme. By punishing the modeling unit which is not in the conventional time interval in the decoding path, the score of the decoding path can be reduced, and the awakening word voice under the abnormal condition can be filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.

In step 208, if the decoding path score is greater than the set threshold, the electronic device is awakened.

And for the awakening word voice, punishing the decoding path according to the punishment function of each modeling unit, namely punishing the modeling unit which is not in the conventional time interval in the decoding path, so that the score of the decoding path can be reduced, and if the punishment of the decoding path score is greater than a set threshold value, awakening the electronic equipment.

Specifically, the score of each decoding path is calculated according to the decoding network, whether the score is larger than a set threshold value T or not is judged, if the score is larger than T, voice awakening is executed, and otherwise, awakening is not executed.

According to the voice awakening method, the posterior probability of each frame of the awakening word obtained by the first acoustic model is used as the input of the decoding network, the decoding path is searched in the decoding network, the modeling unit which is not in the conventional time interval is punished, and the modeling unit which is not in the conventional time interval in the decoding path is punished, so that the score of the decoding path can be reduced, and the voice of the awakening word under the abnormal condition is filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.

In another embodiment, the voice wake-up method further comprises: and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, constructing a penalty function of each modeling unit according to a statistical result, and punishing the modeling unit which is not in the conventional duration interval by the penalty function.

Specifically, the regular duration of each modeling unit specifically refers to the number of frames that each modeling unit, such as a phoneme, lasts for under normal conditions. And obtaining the statistical result of the conventional duration of each modeling unit by counting the awakening word data set. For example, for a phoneme X, the number of continuous frames corresponding to one data set is 3, the number of continuous frames corresponding to one data set is 5, and statistics shows that the number of conventional continuous time frames is 3-5, a penalty function is constructed, and in the process of identifying the wakeup word, if the number of continuous frames of music is less than 3 or greater than 5, the penalty function of the phoneme is used for penalty. The number of the continuous frames which are in accordance with the actual normal condition of each modeling unit can be obtained through analyzing the frequency statistics of the conventional continuous duration of each modeling unit.

Specifically, frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to a statistical result, as shown in fig. 4, the method includes the following steps:

s402, the awakening word data sets are aligned through the second acoustic model, and frequency statistics is carried out on the duration of each modeling unit based on the alignment result.

The second acoustic model is a large holophonic acoustic model. The alignment means that the awakening word data sets are under the same evaluation standard, specifically, the second acoustic model is adopted for alignment, the duration of each modeling unit of each data set is analyzed according to the same standard, and the accuracy of the data is ensured.

Specifically, the awakening word data set is aligned through the second acoustic model, and a decoding path corresponding to each awakening word in the data set is obtained; and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.

In this embodiment, the alignment is a network of full phonemes, and the network is used to decode the wakeup word speech of the first acoustic model training set to obtain a decoding path thereof, where the decoding path is an alignment result.

Histogram statistics can be carried out on the alignment results, and frequency statistics can be carried out on the duration of each modeling unit. Specifically, the histogram statistics is to perform frequency statistics on each modeling unit of the keyword decoding network in the alignment result, and perform normalization processing on the statistical result.

S404, performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining the conventional time interval of the modeling unit through the distribution curve.

Specifically, the gamma curve fitting is adopted, but not limited to the gamma curve fitting, and gaussian fitting, polynomial fitting, and the like can also be adopted. Gamma curve function:

wherein x is sample data of gamma function, and parameters alpha and beta are shape parameter and scale parameter respectively. The graph in fig. 5 is the result of fitting a modeling unit (representing a phoneme, with different thresholds for each phoneme) to a histogram using a gamma curve, where the horizontal axis represents the number of frames in the decoding path and the vertical axis represents the specific gamma value. The dashed lines on both sides of the Gamma curve are two thresholds (the left dashed line is t1, the right dashed line is t2), and the time length between the two thresholds represents the conventional time length interval of the modeling unit.

S406, a penalty function is constructed according to the distribution curve, and the penalty function penalizes the modeling unit which is not in the conventional time length interval.

Specifically, according to the distribution curve, two thresholds of the penalty term are determined by the area ratio, and five percent of the contained area is penalized, and the other ninety-five percent remains unchanged. The threshold corresponds to the end of the regular duration interval. And punishing the modeling units which are not in the conventional duration interval.

For example, the penalty function is:

where t1 and t2 are two thresholds for penalty terms, respectively, it is noted that the two thresholds are adaptively adjusted by an area ratio, such as setting a penalty on five percent of the area encompassed by the gamma curve, and the other ninety-five percent remains the same.

Further, searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit, and obtaining a score of the decoding path, wherein the score comprises: searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path; and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.

Specifically, the initial position and the boundary of a decoding path are obtained, the posterior probability of each frame of an acoustic model is used as input data of a decoding network, the score of each decoding path is calculated, and whether the current frame reaches the end state or not is judged in the forward reasoning process of the network; and when the state reaches the ending state, continuing to search the decoding path and searching the next phoneme. And adding tracking aiming at each modeling unit so as to obtain a decoding path of each modeling unit for decoding the current frame of the network. The method comprises the steps of calculating the starting position and the boundary of a modeling unit (each phoneme) aiming at the decoding paths of a plurality of modeling units of a current frame to obtain the path length, obtaining the duration of each modeling unit, and obtaining the length of each decoding path and the starting position in the corresponding path according to the duration of each phoneme and the input of a corresponding penalty function to obtain the boundary of the modeling unit. In the forward estimation of a decoding network, the scores of each modeling unit of a current frame are the scores of each decoding path, and whether to wake up or not is determined according to the decoding path scores output by the modeling unit at the tail end of a keyword; the scores of the decoding paths also include penalty score for each modeling unit design.

Taking the decoding path (dotted line portion) in fig. 3 as an example, the start position and the boundary of the decoding path are obtained, where the decoding path is (si, x, iao, iao, d, u, u, x, x, iao, iao, iao, d, d, u, u), and then the start value and the boundary value corresponding to each phoneme can be calculated according to the decoded result, so as to obtain each modeling unit length of the decoding path, i.e. the duration of each phoneme.

The calculation of the decoding network score is composed of two parts, one part is a specific score of the decoding path, the other part is a score calculated according to the decoding path and the penalty function, and the final score is obtained by adding the scores of the two parts.

Let the score of the decoding path be st, where st is the data that has been converted to the log domain, and the total score of the final decoding path be:

g(t)＝st+λ*log(f(t))

wherein λ is a penalty factor for controlling the size of the penalty term, λ × log (f (t)) is the penalty term;

according to the same mode, the phoneme (x) or (iao) is replaced by the constraint punishment of the word (xiao), the word (xiao du) and the keyword (xiao du xiao du), so that the multi-layer score evaluation is realized, the pseudo awakening can be eliminated to a great extent, and the voice awakening accuracy is improved.

In the method, the acoustic model is trained through a large-scale network, the data set containing the keywords is decoded and aligned in the training process, then histogram statistics is carried out according to the aligned result, a gamma curve is adopted for fitting according to the histogram statistical result, so that a decoding penalty function is obtained, the decoding penalty function is utilized for punishing the score of the decoding network, the probability that the voice awakening faces the false awakening caused by the overlapped keywords or the speed problem is reduced, and the accuracy of the voice awakening is improved.

It should be understood that although the steps in the flowcharts of fig. 2 and 4 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 6, there is provided a voice wake-up apparatus, including:

a voice obtaining module 602, configured to obtain a voice of a wakeup word.

The identifying module 604 is configured to obtain a posterior probability of each frame of the wakeup word through a pre-trained first acoustic model.

A decoding module 606, configured to search a decoding path in a decoding network of the wakeup word by using the posterior probability, and punish the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path, where the penalty function punishs a modeling unit that is not in a conventional duration interval.

The wake-up module 608 is configured to wake up the electronic device if the decoding path score is greater than a set threshold.

According to the voice awakening device, the posterior probability of each frame of the awakening word obtained by the first acoustic model is used as the input of the decoding network, the decoding path is searched in the decoding network, the modeling unit which is not in the conventional time interval is punished, and the modeling unit which is not in the conventional time interval in the decoding path is punished, so that the score of the decoding path can be reduced, and the voice of the awakening word under the abnormal condition is filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.

In another embodiment, the voice wake-up apparatus further comprises:

and the construction module is used for carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function punishs the modeling units which are not in the conventional duration interval.

Wherein, the construction module includes:

and the alignment module is used for aligning the awakening word data set through the second acoustic model and carrying out frequency statistics on the duration of each modeling unit based on the alignment result.

The curve fitting module is used for performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit through the distribution curve;

and the penalty function constructing module is used for constructing a penalty function according to the distribution curve, and the penalty function punishs the modeling unit which is not in the conventional duration interval.

In another embodiment, the alignment module is configured to align the wakeup word data set through the second acoustic model to obtain a decoding path corresponding to each wakeup word in the data set; and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.

In another embodiment, the voice wake-up apparatus further comprises:

the modeling unit determining module is used for determining a modeling unit of the awakening word;

and the decoding network sheet construction module is used for constructing a decoding network according to the modeling unit, and the tracking information is added into the modeling unit.

In another embodiment, the decoding module is configured to search decoding paths in a decoding network of the wakeup word by using the posterior probability to obtain a first score of each decoding path; searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path; and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.

For specific limitations of the voice wake-up apparatus, reference may be made to the above limitations of the voice wake-up method, which is not described herein again. The modules in the voice wake-up device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a communication interface, a display screen, and a microphone connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a voice wake-up method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring awakening word voice;

In one embodiment, the processor, when executing the computer program, further performs the steps of:

and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval. In another embodiment, frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to the statistical result, including:

In another embodiment, aligning the wakeup word data set through the second acoustic model, and performing frequency statistics on the duration of each modeling unit based on the alignment result includes:

a modeling unit for determining a wakeup word;

In another embodiment, searching a decoding path in a decoding network of the wakeup word by using the posterior probability, and punishing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path includes:

and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring awakening word voice;

a modeling unit for determining a wakeup word;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A voice wake-up method, the method comprising:

acquiring awakening word voice;

2. The method of claim 1, further comprising: and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval.

3. The method according to claim 1, wherein frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to the statistical result, and the method comprises the following steps:

4. The method of claim 3, wherein the step of aligning the wakeup word data set through the second acoustic model and performing frequency statistics on the duration of each modeling unit based on the alignment result comprises:

5. The method of claim 1, further comprising:

a modeling unit for determining a wakeup word;

6. The method of claim 1, wherein searching a decoding path in a decoding network of the wakeup word using the posterior probability, and penalizing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path comprises:

7. A voice wake-up apparatus, the apparatus comprising:

8. The apparatus of claim 7, further comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.