CN112652306A - Voice wake-up method and device, computer equipment and storage medium - Google Patents

Voice wake-up method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112652306A
CN112652306A CN202011599330.9A CN202011599330A CN112652306A CN 112652306 A CN112652306 A CN 112652306A CN 202011599330 A CN202011599330 A CN 202011599330A CN 112652306 A CN112652306 A CN 112652306A
Authority
CN
China
Prior art keywords
modeling unit
decoding
decoding path
word
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011599330.9A
Other languages
Chinese (zh)
Other versions
CN112652306B (en
Inventor
匡勇建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Jieli Technology Co Ltd
Original Assignee
Zhuhai Jieli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Jieli Technology Co Ltd filed Critical Zhuhai Jieli Technology Co Ltd
Priority to CN202011599330.9A priority Critical patent/CN112652306B/en
Publication of CN112652306A publication Critical patent/CN112652306A/en
Application granted granted Critical
Publication of CN112652306B publication Critical patent/CN112652306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The application relates to a voice wake-up method, a voice wake-up device, computer equipment and a storage medium. The method comprises the following steps: acquiring awakening word voice; obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model; searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval; and if the decoding path score is larger than the set threshold, waking up the electronic equipment. According to the method, the scores of the decoding paths can be reduced and the awakening word voices under abnormal conditions can be filtered by punishing the modeling units which are not in the conventional time interval in the decoding paths. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.

Description

Voice wake-up method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech wake-up method and apparatus, a computer device, and a storage medium.
Background
With the development of voice recognition technology, voice awakening is widely applied to intelligent product ends, such as intelligent home equipment, vehicle-mounted electronics, intelligent sound boxes, mobile phones, Bluetooth headsets and the like. Voice wakeup mainly involves three major parts: audio preprocessing, acoustic models, keyword decoding. The keyword decoding is relatively independent, but has a significant influence on voice awakening of the whole system.
Keyword decoding is a process of locating the starting and ending time points of a given keyword from a speech stream, where a keyword is a word, typically a noun or phrase, that can express some substantial meaning. The keyword decoding is mostly performed by designing a decoding network, so the design of the decoding network is particularly critical. The existing decoding network can obtain better effect through reasonable design, but when some special situations are met, if the awakening word is overlapped word and the speed of the keyword changes, the accuracy rate of voice awakening is greatly reduced. For example, in voice wake-up decoding, the keywords are overlapped words, such as small-degree keywords, small-music keywords, and the like, and the decoding network is easy to wake up when the user yells half of the keywords; in addition, when the speech rate is too fast or too slow, the performance of the decoding network may also be degraded.
Namely, the existing voice awakening method has the problem of low awakening accuracy.
Disclosure of Invention
In view of the above, it is desirable to provide a voice wake-up method, apparatus, computer device and storage medium capable of improving the wake-up accuracy.
A voice wake-up method, the method comprising:
acquiring awakening word voice;
obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
In one embodiment, the method further comprises: and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval.
In one embodiment, the frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and the penalty function of each modeling unit is constructed according to the statistical result, including:
aligning the awakening word data set through the second acoustic model, and carrying out frequency statistics on the duration of each modeling unit based on the alignment result;
performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit according to the distribution curve;
and constructing a penalty function according to the distribution curve, wherein the penalty function penalizes a modeling unit which is not in a conventional duration interval.
In one embodiment, aligning the wakeup word data set through the second acoustic model, and performing frequency statistics on the duration of each modeling unit based on the alignment result includes:
aligning the awakening word data set through the second acoustic model to obtain a decoding path corresponding to each awakening word in the data set;
and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.
In one embodiment, the method further comprises:
a modeling unit for determining a wakeup word;
and constructing a decoding network according to a modeling unit, wherein the modeling unit is added with tracking information.
In one embodiment, searching a decoding path in a decoding network of the wakeup word by using the posterior probability, and punishing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path includes:
searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path;
and obtaining a decoding path score according to the penalty function of each modeling unit and the first score, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.
A voice wake-up apparatus, the apparatus comprising:
the voice acquisition module is used for acquiring the voice of the awakening word;
the recognition module is used for obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
the decoding module is used for searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and the awakening module is used for awakening the electronic equipment if the decoding path score is greater than the set threshold value.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring awakening word voice;
obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring awakening word voice;
obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
According to the voice awakening method, the voice awakening device, the computer equipment and the storage medium, the posterior probability of each frame of the awakening word is obtained through the acoustic model and is used as the input of the decoding network, the decoding path is searched in the decoding network, the modeling unit which is not in the conventional time interval is punished, and the modeling unit which is not in the conventional time interval in the decoding path is punished, so that the score of the decoding path can be reduced, and the awakening word voice under the abnormal condition can be filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a voice wake-up method;
FIG. 2 is a flow chart illustrating a voice wake-up method according to an embodiment;
FIG. 3 is a schematic diagram of a decoding network in one embodiment;
FIG. 4 is a schematic flow chart diagram illustrating the steps for constructing a penalty function in one embodiment;
FIG. 5 is a schematic illustration of a gamm curve fit to the alignment results in one embodiment;
FIG. 6 is a block diagram of a voice wake-up unit in one embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The voice wake-up method provided by the present application can be applied to the application environment shown in fig. 1. Wherein the user inputs a wake-up voice to the terminal 102. The terminal 102 acquires the voice of the awakening word; obtaining the posterior probability of each frame of the awakening word through a pre-trained acoustic model; searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval; and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
The terminal 102 may be, but is not limited to, a smart home device, a vehicle-mounted electronic device, a smart speaker, a mobile phone, and a bluetooth headset.
In one embodiment, as shown in fig. 2, a voice wake-up method is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:
step 202, obtaining the voice of the awakening word.
Specifically, the electronic device is provided with a microphone for collecting the voice of the wake-up word. The wake-up word refers to a voice instruction for waking up the electronic device from a low power consumption state.
And 204, obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model.
The first acoustic model is obtained by pre-training, is not limited to a traditional acoustic model algorithm or a neural network-based acoustic model algorithm, and is an acoustic model for improving voice wakeup in the application. The probability value that the features of each frame of the awakening word belong to a certain modeling unit can be obtained through the first acoustic model. The modeling unit corresponds to a phoneme or a word, a phrase (keyword). Taking a phoneme as an example, the probability value of the feature of each speech frame of the wake-up word belonging to a certain phoneme can be obtained through the first acoustic model. Specifically, in the process of identifying the first acoustic model, the wake-up word is firstly framed, and the probability value of the characteristic of each frame belonging to each modeling unit is predicted to obtain the posterior probability of each frame of the wake-up word.
And step 206, searching a decoding path in the decoding network of the awakening word by using the posterior probability, punishing the decoding path according to the punishment function of each modeling unit to obtain the score of the decoding path, wherein the punishment function punishs the modeling unit which is not in the conventional time interval.
Wherein the decoding network structure is constructed according to specific wake-up keywords and specific modeling units (each phoneme). And aiming at the specific awakening words, constructing a decoding network according to the corresponding modeling units, wherein the number of the modeling units is at least one. One modeling unit corresponds to one phoneme or one word, one phrase (keyword). Specifically, a modeling unit of a wake-up word is determined; and constructing a decoding network according to the modeling unit, and adding tracking information into the modeling unit. The modeling unit is not limited to phonemes, and may be a word, an awakening word group, or the like, the phoneme of each awakening word may be used as the modeling unit to construct a decoding network, or each word of the awakening word may be used as the modeling unit to construct the decoding network, and the modeling unit adds tracking information.
Taking the awakening word as "small beep and small beep" as an example, taking the phoneme "x, iao, d, u" as a modeling unit, constructing a decoding network as shown in fig. 3, wherein the vertical axis of the coordinate axis is the structure of the decoding network, and the decoding network comprises the state corresponding to each modeling unit, the starting state(s) and the ending state (e), gbg and sil respectively correspond to the non-keyword phoneme and the mute phoneme; x, iao, d, u, x, iao, d, u correspond to a phoneme of a keyword "small beep", development of the keyword phoneme in a time dimension is as indicated by an arrow in fig. 3, the arrow is directed to a direction in which a voice signal detection path is developed over time
Specifically, forward reasoning is performed in the decoding network to perform path searching. The forward reasoning of the decoding network takes the posterior probability of each frame of the first acoustic model as input data, and the posterior probability of the first acoustic model can be output by a common traditional acoustic model algorithm or can be output by an acoustic model based on a neural network.
In the forward reasoning process, a decoding network is used for carrying out keyword detection on input voice, when matched keyword information is detected, the current decoding path length is calculated, the duration time of the phonemes is obtained, and the decoding path length is obtained until all keywords are detected, so that the duration time of each phoneme is obtained.
And in the decoding path searching process, punishment is carried out on the modeling unit which is not in the conventional time length interval. For example, the regular time duration interval of a certain phoneme is 2-5 frames, and if the phoneme time duration corresponding to the current wake-up word speech determined by the decoding path is 6 frames and is not in the regular time duration interval, punishment is performed on the phoneme. By punishing the modeling unit which is not in the conventional time interval in the decoding path, the score of the decoding path can be reduced, and the awakening word voice under the abnormal condition can be filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.
In step 208, if the decoding path score is greater than the set threshold, the electronic device is awakened.
And for the awakening word voice, punishing the decoding path according to the punishment function of each modeling unit, namely punishing the modeling unit which is not in the conventional time interval in the decoding path, so that the score of the decoding path can be reduced, and if the punishment of the decoding path score is greater than a set threshold value, awakening the electronic equipment.
Specifically, the score of each decoding path is calculated according to the decoding network, whether the score is larger than a set threshold value T or not is judged, if the score is larger than T, voice awakening is executed, and otherwise, awakening is not executed.
According to the voice awakening method, the posterior probability of each frame of the awakening word obtained by the first acoustic model is used as the input of the decoding network, the decoding path is searched in the decoding network, the modeling unit which is not in the conventional time interval is punished, and the modeling unit which is not in the conventional time interval in the decoding path is punished, so that the score of the decoding path can be reduced, and the voice of the awakening word under the abnormal condition is filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.
In another embodiment, the voice wake-up method further comprises: and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, constructing a penalty function of each modeling unit according to a statistical result, and punishing the modeling unit which is not in the conventional duration interval by the penalty function.
Specifically, the regular duration of each modeling unit specifically refers to the number of frames that each modeling unit, such as a phoneme, lasts for under normal conditions. And obtaining the statistical result of the conventional duration of each modeling unit by counting the awakening word data set. For example, for a phoneme X, the number of continuous frames corresponding to one data set is 3, the number of continuous frames corresponding to one data set is 5, and statistics shows that the number of conventional continuous time frames is 3-5, a penalty function is constructed, and in the process of identifying the wakeup word, if the number of continuous frames of music is less than 3 or greater than 5, the penalty function of the phoneme is used for penalty. The number of the continuous frames which are in accordance with the actual normal condition of each modeling unit can be obtained through analyzing the frequency statistics of the conventional continuous duration of each modeling unit.
Specifically, frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to a statistical result, as shown in fig. 4, the method includes the following steps:
s402, the awakening word data sets are aligned through the second acoustic model, and frequency statistics is carried out on the duration of each modeling unit based on the alignment result.
The second acoustic model is a large holophonic acoustic model. The alignment means that the awakening word data sets are under the same evaluation standard, specifically, the second acoustic model is adopted for alignment, the duration of each modeling unit of each data set is analyzed according to the same standard, and the accuracy of the data is ensured.
Specifically, the awakening word data set is aligned through the second acoustic model, and a decoding path corresponding to each awakening word in the data set is obtained; and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.
In this embodiment, the alignment is a network of full phonemes, and the network is used to decode the wakeup word speech of the first acoustic model training set to obtain a decoding path thereof, where the decoding path is an alignment result.
Histogram statistics can be carried out on the alignment results, and frequency statistics can be carried out on the duration of each modeling unit. Specifically, the histogram statistics is to perform frequency statistics on each modeling unit of the keyword decoding network in the alignment result, and perform normalization processing on the statistical result.
S404, performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining the conventional time interval of the modeling unit through the distribution curve.
Specifically, the gamma curve fitting is adopted, but not limited to the gamma curve fitting, and gaussian fitting, polynomial fitting, and the like can also be adopted. Gamma curve function:
Figure BDA0002868863620000081
wherein x is sample data of gamma function, and parameters alpha and beta are shape parameter and scale parameter respectively. The graph in fig. 5 is the result of fitting a modeling unit (representing a phoneme, with different thresholds for each phoneme) to a histogram using a gamma curve, where the horizontal axis represents the number of frames in the decoding path and the vertical axis represents the specific gamma value. The dashed lines on both sides of the Gamma curve are two thresholds (the left dashed line is t1, the right dashed line is t2), and the time length between the two thresholds represents the conventional time length interval of the modeling unit.
S406, a penalty function is constructed according to the distribution curve, and the penalty function penalizes the modeling unit which is not in the conventional time length interval.
Specifically, according to the distribution curve, two thresholds of the penalty term are determined by the area ratio, and five percent of the contained area is penalized, and the other ninety-five percent remains unchanged. The threshold corresponds to the end of the regular duration interval. And punishing the modeling units which are not in the conventional duration interval.
For example, the penalty function is:
Figure BDA0002868863620000082
where t1 and t2 are two thresholds for penalty terms, respectively, it is noted that the two thresholds are adaptively adjusted by an area ratio, such as setting a penalty on five percent of the area encompassed by the gamma curve, and the other ninety-five percent remains the same.
Further, searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit, and obtaining a score of the decoding path, wherein the score comprises: searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path; and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.
Specifically, the initial position and the boundary of a decoding path are obtained, the posterior probability of each frame of an acoustic model is used as input data of a decoding network, the score of each decoding path is calculated, and whether the current frame reaches the end state or not is judged in the forward reasoning process of the network; and when the state reaches the ending state, continuing to search the decoding path and searching the next phoneme. And adding tracking aiming at each modeling unit so as to obtain a decoding path of each modeling unit for decoding the current frame of the network. The method comprises the steps of calculating the starting position and the boundary of a modeling unit (each phoneme) aiming at the decoding paths of a plurality of modeling units of a current frame to obtain the path length, obtaining the duration of each modeling unit, and obtaining the length of each decoding path and the starting position in the corresponding path according to the duration of each phoneme and the input of a corresponding penalty function to obtain the boundary of the modeling unit. In the forward estimation of a decoding network, the scores of each modeling unit of a current frame are the scores of each decoding path, and whether to wake up or not is determined according to the decoding path scores output by the modeling unit at the tail end of a keyword; the scores of the decoding paths also include penalty score for each modeling unit design.
Taking the decoding path (dotted line portion) in fig. 3 as an example, the start position and the boundary of the decoding path are obtained, where the decoding path is (si, x, iao, iao, d, u, u, x, x, iao, iao, iao, d, d, u, u), and then the start value and the boundary value corresponding to each phoneme can be calculated according to the decoded result, so as to obtain each modeling unit length of the decoding path, i.e. the duration of each phoneme.
The calculation of the decoding network score is composed of two parts, one part is a specific score of the decoding path, the other part is a score calculated according to the decoding path and the penalty function, and the final score is obtained by adding the scores of the two parts.
Let the score of the decoding path be st, where st is the data that has been converted to the log domain, and the total score of the final decoding path be:
g(t)=st+λ*log(f(t))
wherein λ is a penalty factor for controlling the size of the penalty term, λ × log (f (t)) is the penalty term;
according to the same mode, the phoneme (x) or (iao) is replaced by the constraint punishment of the word (xiao), the word (xiao du) and the keyword (xiao du xiao du), so that the multi-layer score evaluation is realized, the pseudo awakening can be eliminated to a great extent, and the voice awakening accuracy is improved.
In the method, the acoustic model is trained through a large-scale network, the data set containing the keywords is decoded and aligned in the training process, then histogram statistics is carried out according to the aligned result, a gamma curve is adopted for fitting according to the histogram statistical result, so that a decoding penalty function is obtained, the decoding penalty function is utilized for punishing the score of the decoding network, the probability that the voice awakening faces the false awakening caused by the overlapped keywords or the speed problem is reduced, and the accuracy of the voice awakening is improved.
It should be understood that although the steps in the flowcharts of fig. 2 and 4 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.
In one embodiment, as shown in fig. 6, there is provided a voice wake-up apparatus, including:
a voice obtaining module 602, configured to obtain a voice of a wakeup word.
The identifying module 604 is configured to obtain a posterior probability of each frame of the wakeup word through a pre-trained first acoustic model.
A decoding module 606, configured to search a decoding path in a decoding network of the wakeup word by using the posterior probability, and punish the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path, where the penalty function punishs a modeling unit that is not in a conventional duration interval.
The wake-up module 608 is configured to wake up the electronic device if the decoding path score is greater than a set threshold.
According to the voice awakening device, the posterior probability of each frame of the awakening word obtained by the first acoustic model is used as the input of the decoding network, the decoding path is searched in the decoding network, the modeling unit which is not in the conventional time interval is punished, and the modeling unit which is not in the conventional time interval in the decoding path is punished, so that the score of the decoding path can be reduced, and the voice of the awakening word under the abnormal condition is filtered. For example, the voice of the wake-up word which is not in accordance with the conventional wake-up scene, such as too slow or too fast voice speed, can improve the accuracy of wake-up.
In another embodiment, the voice wake-up apparatus further comprises:
and the construction module is used for carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function punishs the modeling units which are not in the conventional duration interval.
Wherein, the construction module includes:
and the alignment module is used for aligning the awakening word data set through the second acoustic model and carrying out frequency statistics on the duration of each modeling unit based on the alignment result.
The curve fitting module is used for performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit through the distribution curve;
and the penalty function constructing module is used for constructing a penalty function according to the distribution curve, and the penalty function punishs the modeling unit which is not in the conventional duration interval.
In another embodiment, the alignment module is configured to align the wakeup word data set through the second acoustic model to obtain a decoding path corresponding to each wakeup word in the data set; and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.
In another embodiment, the voice wake-up apparatus further comprises:
the modeling unit determining module is used for determining a modeling unit of the awakening word;
and the decoding network sheet construction module is used for constructing a decoding network according to the modeling unit, and the tracking information is added into the modeling unit.
In another embodiment, the decoding module is configured to search decoding paths in a decoding network of the wakeup word by using the posterior probability to obtain a first score of each decoding path; searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path; and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.
For specific limitations of the voice wake-up apparatus, reference may be made to the above limitations of the voice wake-up method, which is not described herein again. The modules in the voice wake-up device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a communication interface, a display screen, and a microphone connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a voice wake-up method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring awakening word voice;
obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval. In another embodiment, frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to the statistical result, including:
aligning the awakening word data set through the second acoustic model, and carrying out frequency statistics on the duration of each modeling unit based on the alignment result;
performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit according to the distribution curve;
and constructing a penalty function according to the distribution curve, wherein the penalty function penalizes a modeling unit which is not in a conventional duration interval.
In another embodiment, aligning the wakeup word data set through the second acoustic model, and performing frequency statistics on the duration of each modeling unit based on the alignment result includes:
aligning the awakening word data set through the second acoustic model to obtain a decoding path corresponding to each awakening word in the data set;
and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
a modeling unit for determining a wakeup word;
and constructing a decoding network according to a modeling unit, wherein the modeling unit is added with tracking information.
In another embodiment, searching a decoding path in a decoding network of the wakeup word by using the posterior probability, and punishing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path includes:
searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path;
and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring awakening word voice;
obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval. In another embodiment, frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to the statistical result, including:
aligning the awakening word data set through the second acoustic model, and carrying out frequency statistics on the duration of each modeling unit based on the alignment result;
performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit according to the distribution curve;
and constructing a penalty function according to the distribution curve, wherein the penalty function penalizes a modeling unit which is not in a conventional duration interval.
In another embodiment, aligning the wakeup word data set through the second acoustic model, and performing frequency statistics on the duration of each modeling unit based on the alignment result includes:
aligning the awakening word data set through the second acoustic model to obtain a decoding path corresponding to each awakening word in the data set;
and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
a modeling unit for determining a wakeup word;
and constructing a decoding network according to a modeling unit, wherein the modeling unit is added with tracking information.
In another embodiment, searching a decoding path in a decoding network of the wakeup word by using the posterior probability, and punishing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path includes:
searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path;
and obtaining a decoding path score according to the penalty function and the first score of each modeling unit, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A voice wake-up method, the method comprising:
acquiring awakening word voice;
obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and if the decoding path score is larger than the set threshold, waking up the electronic equipment.
2. The method of claim 1, further comprising: and carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word, and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function penalizes the modeling unit which is not in the conventional duration interval.
3. The method according to claim 1, wherein frequency statistics is performed on the regular duration of each modeling unit of the wakeup word, and a penalty function of each modeling unit is constructed according to the statistical result, and the method comprises the following steps:
aligning the awakening word data set through the second acoustic model, and carrying out frequency statistics on the duration of each modeling unit based on the alignment result;
performing curve fitting on the statistical result to obtain a distribution curve of the modeling unit, and determining a conventional time interval of the modeling unit according to the distribution curve;
and constructing a penalty function according to the distribution curve, wherein the penalty function penalizes a modeling unit which is not in a conventional duration interval.
4. The method of claim 3, wherein the step of aligning the wakeup word data set through the second acoustic model and performing frequency statistics on the duration of each modeling unit based on the alignment result comprises:
aligning the awakening word data set through the second acoustic model to obtain a decoding path corresponding to each awakening word in the data set;
and carrying out frequency statistics on the duration of each modeling unit according to the decoding path of each awakening word in the data set.
5. The method of claim 1, further comprising:
a modeling unit for determining a wakeup word;
and constructing a decoding network according to a modeling unit, wherein the modeling unit is added with tracking information.
6. The method of claim 1, wherein searching a decoding path in a decoding network of the wakeup word using the posterior probability, and penalizing the decoding path according to a penalty function of each modeling unit to obtain a score of the decoding path comprises:
searching decoding paths in a decoding network of the awakening words by using the posterior probability to obtain a first score of each decoding path;
and obtaining a decoding path score according to the penalty function of each modeling unit and the first score, wherein the penalty function penalizes the modeling unit which is not in the conventional time length interval.
7. A voice wake-up apparatus, the apparatus comprising:
the voice acquisition module is used for acquiring the voice of the awakening word;
the recognition module is used for obtaining the posterior probability of each frame of the awakening word through a pre-trained first acoustic model;
the decoding module is used for searching a decoding path in a decoding network of the awakening word by using the posterior probability, punishing the decoding path according to a punishment function of each modeling unit to obtain a score of the decoding path, wherein the punishment function punishs the modeling unit which is not in a conventional time interval;
and the awakening module is used for awakening the electronic equipment if the decoding path score is greater than the set threshold value.
8. The apparatus of claim 7, further comprising:
and the construction module is used for carrying out frequency statistics on the conventional duration of each modeling unit of the awakening word and constructing a penalty function of each modeling unit according to a statistical result, wherein the penalty function punishs the modeling units which are not in the conventional duration interval.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202011599330.9A 2020-12-29 2020-12-29 Voice wakeup method, voice wakeup device, computer equipment and storage medium Active CN112652306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011599330.9A CN112652306B (en) 2020-12-29 2020-12-29 Voice wakeup method, voice wakeup device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011599330.9A CN112652306B (en) 2020-12-29 2020-12-29 Voice wakeup method, voice wakeup device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112652306A true CN112652306A (en) 2021-04-13
CN112652306B CN112652306B (en) 2023-10-03

Family

ID=75364011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011599330.9A Active CN112652306B (en) 2020-12-29 2020-12-29 Voice wakeup method, voice wakeup device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112652306B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327597A (en) * 2021-06-23 2021-08-31 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN113707132A (en) * 2021-09-08 2021-11-26 北京声智科技有限公司 Awakening method and electronic equipment
CN114333799A (en) * 2022-03-09 2022-04-12 深圳市友杰智新科技有限公司 Detection method and device for phase-to-phase sound misidentification and computer equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
US20150279351A1 (en) * 2012-12-19 2015-10-01 Google Inc. Keyword detection based on acoustic alignment
CN105374352A (en) * 2014-08-22 2016-03-02 中国科学院声学研究所 Voice activation method and system
CN107871499A (en) * 2017-10-27 2018-04-03 珠海市杰理科技股份有限公司 Audio recognition method, system, computer equipment and computer-readable recording medium
CN110910885A (en) * 2019-12-12 2020-03-24 苏州思必驰信息科技有限公司 Voice awakening method and device based on decoding network
CN111402895A (en) * 2020-06-08 2020-07-10 腾讯科技(深圳)有限公司 Voice processing method, voice evaluating method, voice processing device, voice evaluating device, computer equipment and storage medium
CN111462751A (en) * 2020-03-27 2020-07-28 京东数字科技控股有限公司 Method, apparatus, computer device and storage medium for decoding voice data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
US20150279351A1 (en) * 2012-12-19 2015-10-01 Google Inc. Keyword detection based on acoustic alignment
CN105374352A (en) * 2014-08-22 2016-03-02 中国科学院声学研究所 Voice activation method and system
CN107871499A (en) * 2017-10-27 2018-04-03 珠海市杰理科技股份有限公司 Audio recognition method, system, computer equipment and computer-readable recording medium
CN110910885A (en) * 2019-12-12 2020-03-24 苏州思必驰信息科技有限公司 Voice awakening method and device based on decoding network
CN111462751A (en) * 2020-03-27 2020-07-28 京东数字科技控股有限公司 Method, apparatus, computer device and storage medium for decoding voice data
CN111402895A (en) * 2020-06-08 2020-07-10 腾讯科技(深圳)有限公司 Voice processing method, voice evaluating method, voice processing device, voice evaluating device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
C.-H. LEE 等: "A frame-synchronous network search algorithm for connected word recognition", 《IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING 》 *
L. TEN BOSCH 等: "Acoustic Scores and Symbolic Mismatch Penalties in Phone Lattices", 《 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING PROCEEDINGS》 *
LOUIS TEN BOSCH,等: "ASR Decoding in a Computational Model of Human Word Recognition", 《PROCEEDINGS OF INTERSPEECH 2005》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327597A (en) * 2021-06-23 2021-08-31 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN113327597B (en) * 2021-06-23 2023-08-22 网易(杭州)网络有限公司 Speech recognition method, medium, device and computing equipment
CN113707132A (en) * 2021-09-08 2021-11-26 北京声智科技有限公司 Awakening method and electronic equipment
CN113707132B (en) * 2021-09-08 2024-03-01 北京声智科技有限公司 Awakening method and electronic equipment
CN114333799A (en) * 2022-03-09 2022-04-12 深圳市友杰智新科技有限公司 Detection method and device for phase-to-phase sound misidentification and computer equipment
CN114333799B (en) * 2022-03-09 2022-08-02 深圳市友杰智新科技有限公司 Detection method and device for phase-to-phase sound misidentification and computer equipment

Also Published As

Publication number Publication date
CN112652306B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN112652306B (en) Voice wakeup method, voice wakeup device, computer equipment and storage medium
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
CN110534099B (en) Voice wake-up processing method and device, storage medium and electronic equipment
CN108735201B (en) Continuous speech recognition method, device, equipment and storage medium
Myer et al. Efficient keyword spotting using time delay neural networks
CN106940998A (en) A kind of execution method and device of setting operation
CN111833866A (en) Method and system for high accuracy key phrase detection for low resource devices
CN110070859B (en) Voice recognition method and device
CN111862951B (en) Voice endpoint detection method and device, storage medium and electronic equipment
CN111312222A (en) Awakening and voice recognition model training method and device
CN113450771B (en) Awakening method, model training method and device
CN112967739B (en) Voice endpoint detection method and system based on long-term and short-term memory network
CN111462756A (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN115457938A (en) Method, device, storage medium and electronic device for identifying awakening words
CN112825250A (en) Voice wake-up method, apparatus, storage medium and program product
CN114360510A (en) Voice recognition method and related device
CN111785302A (en) Speaker separation method and device and electronic equipment
CN116978370A (en) Speech processing method, device, computer equipment and storage medium
CN112509556B (en) Voice awakening method and device
CN113851113A (en) Model training method and device and voice awakening method and device
CN110537223B (en) Voice detection method and device
CN112216286B (en) Voice wakeup recognition method and device, electronic equipment and storage medium
CN113658593B (en) Wake-up realization method and device based on voice recognition
CN112289311B (en) Voice wakeup method and device, electronic equipment and storage medium
CN113129874B (en) Voice awakening method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 519000 No. 333, Kexing Road, Xiangzhou District, Zhuhai City, Guangdong Province

Applicant after: ZHUHAI JIELI TECHNOLOGY Co.,Ltd.

Address before: Floor 1-107, building 904, ShiJiHua Road, Zhuhai City, Guangdong Province

Applicant before: ZHUHAI JIELI TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant