CN113593560B - Customizable low-delay command word recognition method and device - Google Patents

Customizable low-delay command word recognition method and device Download PDF

Info

Publication number
CN113593560B
CN113593560B CN202110865579.8A CN202110865579A CN113593560B CN 113593560 B CN113593560 B CN 113593560B CN 202110865579 A CN202110865579 A CN 202110865579A CN 113593560 B CN113593560 B CN 113593560B
Authority
CN
China
Prior art keywords
command word
modeling unit
neural network
voice data
posterior probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110865579.8A
Other languages
Chinese (zh)
Other versions
CN113593560A (en
Inventor
司玉景
李全忠
何国涛
蒲瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Original Assignee
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Puqiang Times Zhuhai Hengqin Information Technology Co ltd filed Critical Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority to CN202110865579.8A priority Critical patent/CN113593560B/en
Publication of CN113593560A publication Critical patent/CN113593560A/en
Application granted granted Critical
Publication of CN113593560B publication Critical patent/CN113593560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention relates to a customizable low-delay command word recognition method and device, comprising the steps of obtaining voice to be recognized and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is a pinyin with tone; calculating the confidence coefficient of each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence and the time point. The invention can model all the tone-modulated pinyin in Chinese, adopts a simple and efficient scoring mechanism to complete the recognition task of the low-delay command word list, and reduces the development cost and the time cost of command word recognition. The confidence coefficient calculating method adopted by the invention has extremely low calculating complexity and space complexity, higher accuracy and lower false wake-up rate, and can detect whether the command word appears in real time.

Description

Customizable low-delay command word recognition method and device
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a customizable low-delay command word recognition method and device.
Background
In recent years, with the continuous development of information technology and the internet of things, voice is the most direct and convenient man-machine interaction method, and people are getting more attention. Command word recognition is an important field of speech recognition and is widely used in speech command control systems. The task of the low-latency command word recognition system is to automatically find and locate a number of pre-specified command words in a continuous piece of speech, and the whole process is real-time, i.e. the system needs to give a corresponding result immediately once the command words appear. However, unlike a document in a conventional text format, voice data is used as a code for a sound signal, and it is difficult for a computer to directly extract the data form of effective information. In addition, developing an efficient command word recognition system is complicated and difficult due to the influence of a variety of extrinsic phonemes (e.g., background noise, speaker speech speed, accent, etc.).
In the related art, command word recognition systems can be classified into customizable systems and non-customizable systems according to whether a command word is customizable. The customizable characteristic is reflected in that the command word detection model does not depend on the command words appointed by the user, so that when the user modifies the command word list, the model does not need to be retrained; and in a non-customizable command word system, a command word list is related to a model, and when a user wants to modify the command word list, the user needs to acquire the record and the label of the command word again and then perform model training again, so that the time cost and the development cost are obviously increased. Existing command word recognition techniques include Dynamic Time Warping (DTW), hidden Markov Model (HMM) based methods, and deep learning based methods. The keyword/filter framework based on the hmm+dnn can achieve the purpose of customizing the command word, but the effect is inferior to that of a deep learning-based method, the decoding computation complexity is high, and the memory occupation is large.
Disclosure of Invention
In view of the above, the present invention aims to overcome the shortcomings of the prior art, and provide a customizable low-latency command word recognition method and device, so as to solve the problems of poor customizable system effect, high decoding computation complexity and more memory occupation in the prior art.
In order to achieve the above purpose, the invention adopts the following technical scheme: a customizable low-latency command word recognition method comprising:
acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized;
inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;
calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;
and judging whether to output the command word according to the confidence degree and the time point.
Further, the method further comprises the following steps: building a neural network classification model, the building the neural network classification model comprising:
acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, so as to generate a neural network classification model.
Further, a calculation formula adopted for calculating the confidence coefficient corresponding to each command word according to the posterior probability is as follows:
wherein p is ik When the time point k is represented, the posterior probability corresponding to the ith modeling unit; h is a max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units that the command word includes.
Further, the determining whether to output the command word according to the confidence level and the time point includes:
comparing the confidence coefficient with a preset threshold value;
if the confidence coefficient of the command word is larger than or equal to a preset threshold value, and the time point of the modeling unit included in the command word meets a time preset condition, outputting the command word.
Further, if the confidence coefficient of the plurality of command words is greater than or equal to a preset threshold value and the time point of occurrence of the modeling unit included in the command words meets a time preset condition, outputting the command word with the highest confidence coefficient.
Further, before labeling the corresponding modeling unit for the voice data, the method further includes:
and modeling the tone pinyin corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.
Further, before determining the acoustic feature to be processed according to the voice to be recognized, the method further includes:
and carrying out noise reduction treatment on the voice to be recognized.
Further, the neural network classification model is as follows
Deep feed forward sequence memory neural networks.
The embodiment of the application provides a customizable low-delay command word recognition device, which comprises:
the acquisition module is used for acquiring the voice to be recognized and determining the acoustic characteristics to be processed according to the voice to be recognized;
the identification module is used for inputting the acoustic features into a pre-constructed neural network classification model for identification and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;
the calculation module is used for calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;
and the output module is used for judging whether to output the command word according to the confidence degree and the time point.
Further, the method further comprises the following steps:
the building module is used for building a neural network classification model;
the constructing the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling unit corresponding to the voice data, so as to generate a neural network classification model.
By adopting the technical scheme, the invention has the following beneficial effects:
the invention provides a customizable low-delay command word recognition method and device, which adopts a neural network model to model all the toned pinyin in Chinese, combines the advantage of large posterior probability differentiation of the output of a connection time sequence classification criterion, adopts a simple and efficient scoring mechanism, and provides the customizable low-delay command word recognition method. The method models pinyin information of voice signals by using a deep feed-forward sequence memory neural network (DFSMN) and a connection time sequence classification criterion (CTC), trains models by utilizing massive voice data, and recognizes command word lists by adopting the trained models. In addition, the invention adopts a simple and efficient scoring mechanism to complete the task of identifying the low-delay command word list. Aiming at the demand of changing command words, the invention does not need to retrain a model, only needs to provide the tone spelling information corresponding to the command words, and greatly reduces the development cost and the time cost of the command word recognition system.
The confidence coefficient calculating method provided by the invention aims at the CTC model, has extremely low calculating complexity and space complexity, has higher accuracy and lower false wake-up rate, and has lower time delay, and whether the command word appears or not can be detected in real time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram illustrating steps of a customizable low-latency command word recognition method of the present invention;
FIG. 2 is a flow chart of a customizable low latency command word recognition method of the present invention;
FIG. 3 is a schematic diagram of a customizable low latency command word recognition device of the present invention;
FIG. 4 is a schematic diagram of a hardware operating environment computer device according to the customizable low-latency command word recognition method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.
A specific customizable low-latency command word recognition method and apparatus provided in embodiments of the present application are described below with reference to the accompanying drawings.
As shown in fig. 1, the customizable low-latency command word recognition method provided in the embodiment of the present application includes:
s101, acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized;
it will be appreciated that speech is uttered by the user; for example, a user may say that the intelligent household refrigerator is "opened", and then "opened" is the voice to be recognized, in this application, the voice to be recognized needs to be processed to obtain the acoustic feature of the voice to be recognized, where the processing manner may be implemented by using the existing technology, for example, the voice to be recognized is preprocessed, windowed, FFT transformed, mel filter, and other steps are performed to extract the acoustic feature of the voice to be recognized. Wherein the preprocessing may be acoustic denoising.
S102, inputting the acoustic features into a pre-constructed neural network classification model for recognition, and acquiring posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;
the method comprises the steps of training a neural network classification model in advance, and then inputting the obtained acoustic features into the neural network classification model for calculation to obtain posterior probability of a modeling unit (toned pinyin). The posterior probability is one of the basic concepts of the information theory. In a communication system, after a certain message is received, the probability that the message is transmitted is known to the receiving end as a posterior probability.
S103, calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;
the confidence coefficient calculating method is provided for the neural network classification model, so that the calculating complexity and the space complexity are reduced, and the accuracy is high.
And S104, judging whether to output the command word according to the confidence degree and the time point.
Finally, decision judgment is performed according to the confidence level to output command words, such as outputting 'open refrigerator'.
The working principle of the customizable low-delay command word recognition method is as follows: referring to fig. 2, a neural network classification model is firstly constructed, then a voice to be recognized is obtained, and acoustic characteristics to be processed are determined according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence degree and the time point. Wherein the neural network classification model is a deep feed forward sequence memory neural network (DFSMN). It is understood that the neural network classification model in the present application may also adopt models such as LSTM, GRU, etc., which are not limited herein. The modeling unit in the method is constructed based on initials, finals and intonation, so that a better Chinese character recognition function can be obtained, recognition accuracy is improved, and recognition errors are reduced.
In some embodiments, further comprising: building a neural network classification model, the building the neural network classification model comprising:
acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, so as to generate a neural network classification model.
Specifically, a neural network classification model is built in advance, and the method comprises the specific processes of collecting and labeling voice data, preprocessing the collected and labeled voice data, windowing, FFT (fast Fourier transform), mel filter and the like, extracting acoustic features for model training, inputting the acoustic features corresponding to the voice data into a neural network for training, obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, namely taking labeled toned pinyin as output, performing iterative training on the acoustic features corresponding to the voice data by adopting a time sequence classification loss function (CTC), completing training of model parameters by a deep learning method under massive data, and obtaining a classification model of a usable deep feedforward sequence memory neural network.
In some embodiments, the present application calculates the confidence coefficient corresponding to each command word, that is, the probability that each command word may appear, by using the posterior probability of the toned pinyin calculated by the neural network classification model, and presumes that the command word includes n toned pinyins (because the robustness of the CTC model is better, the present invention adopts the original posterior probability, and omits the step of posterior smoothing).
The calculation formula adopted for calculating the confidence coefficient corresponding to each command word according to the posterior probability is as follows:
wherein p is ik Ith modeling at representation time point kPosterior probability of a cell correspondence; h is a max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units that the command word includes.
In some embodiments, the determining whether to output the command word according to the confidence level and the time point includes:
comparing the confidence coefficient with a preset threshold value;
if the confidence coefficient of the command word is larger than or equal to a preset threshold value, and the time point of the modeling unit included in the command word meets a time preset condition, outputting the command word.
Preferably, if the confidence coefficient of the plurality of command words is greater than or equal to a preset threshold value and a time preset condition is met, the command word with the highest confidence coefficient is output.
Specifically, when the confidence level is greater than or equal to a preset threshold value and the time sequence is satisfied, outputting the command word, and if a plurality of command words simultaneously satisfy the above conditions, outputting the command word with the highest confidence level. That is, when a certain command word is detected at the time point t, the command word needs to satisfy the following conditions simultaneously:
f(t)≥threshold (3)
t 1 ≤t 2 ≤…≤t n (4)
wherein threshold is a preset threshold, t 1 For the time point corresponding to the first modeling unit in the command word, t 2 For the time point corresponding to the second modeling unit in the command word, t n The time point corresponding to the nth modeling unit in the command word. For example, if the command word with the highest confidence is "open the refrigerator", the command word is first converted into the pinyin da3 kai1 bing1 xiang1", where the time point corresponding to the modeling unit" da3 "is t 1 The time point corresponding to the modeling unit "kai1" is t 2 The corresponding time point of the modeling unit 'bing 1' is t 3 Corresponding to the modeling unit "xiang1Time point t 4 At this time, t must be satisfied 1 ≤t 2 ≤t 3 ≤t 4 And when the modeling unit contained in the command word 'open refrigerator' meets the time preset condition, outputting a command of 'open refrigerator'. And if the confidence coefficient of the plurality of command words is larger than a preset threshold value and the time point of occurrence of the modeling unit contained in the plurality of command words meets a time preset condition, outputting the command word corresponding to the maximum confidence coefficient in the confidence coefficient of the plurality of command words.
In some embodiments, before labeling the corresponding modeling unit for the voice data, the method further includes:
and modeling the tone pinyin corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.
The modeling unit in the text recognition method and the text recognition device are namely tone pinyin, and comprise initials, finals and tones, so that the text recognition accuracy can be improved.
Preferably, before determining the acoustic feature to be processed according to the voice to be recognized, the method further includes:
and carrying out noise reduction treatment on the voice to be recognized.
Before acoustic features are extracted from the voice to be recognized, denoising processing is performed on the voice to be recognized, and noise interference is removed.
As shown in fig. 3, an embodiment of the present application provides a customizable low-latency command word recognition device, including:
the acquisition module 301 is configured to acquire a voice to be recognized, and determine an acoustic feature to be processed according to the voice to be recognized;
the identifying module 302 is configured to input the acoustic feature into a pre-constructed neural network classification model for identification, and obtain a posterior probability of each modeling unit to which the acoustic feature belongs; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;
a calculating module 303, configured to calculate, according to the posterior probability, a confidence level corresponding to each command word and a time point at which a modeling unit included in the confidence level appears;
and the output module 304 is configured to determine whether to output the command word according to the confidence level and the time point.
The working principle of the customizable low-delay command word recognition device provided by the application is that an acquisition module 301 acquires voice to be recognized and determines acoustic characteristics to be processed according to the voice to be recognized; the recognition module 302 inputs the acoustic features into a pre-constructed neural network classification model for recognition, and obtains posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; the calculating module 303 calculates the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; the output module 304 determines whether to output the command word according to the confidence and the time point.
Preferably, the customizable low-latency command word recognition device provided in the present application further includes:
the building module is used for building a neural network classification model;
the constructing the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling unit corresponding to the voice data, so as to generate a neural network classification model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
the memory is used for storing a computer program, and the computer program is used for executing the customizable low-delay command word recognition method provided by any embodiment;
the processor is used to call and execute the computer program in the memory. The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among others, in a computer readable medium. The computer device stores an operating system, with memory being an example of a computer-readable medium. The computer program, when executed by the processor, causes the processor to perform the customizable low-latency command word recognition method, such as the architecture shown in fig. 4, is merely a block diagram of portions of the architecture relevant to the present application and is not limiting of the computer device to which the present application is applied, a particular computer device may include more or fewer components than shown in the figures, or may combine certain components, or have a different arrangement of components.
In one embodiment, the customizable low-latency command word recognition method provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in FIG. 4.
In some embodiments, the computer program, when executed by the processor, causes the processor to perform the steps of: acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence degree and the time point.
The present application also provides a computer storage medium, examples of which include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassette storage or other magnetic storage devices, or any other non-transmission medium, that can be used to store information that can be accessed by a computing device.
In some embodiments, the present invention also proposes a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the steps of: acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence degree and the time point.
In summary, the invention provides a customizable low-delay command word recognition method and device, which comprises the steps of obtaining voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone spelling, which comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence and the time point. The invention can model all the tone-modulated pinyin in Chinese, adopts a simple and efficient scoring mechanism to complete the recognition task of the low-delay command word list, and greatly reduces the development cost and the time cost of command word recognition. The confidence coefficient calculating method adopted by the invention has extremely low calculating complexity and space complexity, higher accuracy and lower false wake-up rate, and in addition, has lower time delay, and can detect whether the command word appears in real time.
It can be understood that the above-provided method embodiments correspond to the above-described apparatus embodiments, and corresponding specific details may be referred to each other and will not be described herein.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A customizable low-latency command word recognition method, comprising:
acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized;
inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;
calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;
judging whether to output the command word according to the confidence coefficient and the time point;
further comprises: building a neural network classification model, the building the neural network classification model comprising:
acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, and generating a neural network classification model;
the calculation formula adopted for calculating the confidence coefficient corresponding to each command word according to the posterior probability is as follows:
wherein p is ik When the time point k is represented, the posterior probability corresponding to the ith modeling unit; h is a max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word;
the neural network classification model is a deep feedforward sequence memory neural network.
2. The method of claim 1, wherein said determining whether to output the command word based on the confidence level and the time point comprises:
comparing the confidence coefficient with a preset threshold value;
if the confidence coefficient of the command word is larger than or equal to a preset threshold value, and the time point of the modeling unit included in the command word meets a time preset condition, outputting the command word.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
and if the confidence coefficient of the plurality of command words is greater than or equal to a preset threshold value and the time preset condition is met, outputting the command word with the maximum confidence coefficient.
4. The method of claim 1, further comprising, prior to labeling the speech data with the corresponding modeling unit:
and modeling the tone pinyin corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.
5. The method of claim 1, wherein before determining the acoustic feature to be processed from the speech to be recognized, further comprising:
and carrying out noise reduction treatment on the voice to be recognized.
6. A customizable low-latency command word recognition device, comprising:
the acquisition module is used for acquiring the voice to be recognized and determining the acoustic characteristics to be processed according to the voice to be recognized;
the identification module is used for inputting the acoustic features into a pre-constructed neural network classification model for identification and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;
the calculation module is used for calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;
the output module is used for judging whether to output the command word according to the confidence coefficient and the time point;
the building module is used for building a neural network classification model;
the constructing the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
based on the posterior probability of each modeling unit to which the acoustic feature corresponding to the voice data belongs and the modeling unit corresponding to the voice data, performing iterative training on the acoustic feature corresponding to the voice data by adopting a time sequence classification loss function, and generating a neural network classification model;
the calculation formula adopted for calculating the confidence coefficient corresponding to each command word according to the posterior probability is as follows:
wherein p is ik When the time point k is represented, the posterior probability corresponding to the ith modeling unit; h is a max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word;
the neural network classification model is a deep feedforward sequence memory neural network.
CN202110865579.8A 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device Active CN113593560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110865579.8A CN113593560B (en) 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110865579.8A CN113593560B (en) 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device

Publications (2)

Publication Number Publication Date
CN113593560A CN113593560A (en) 2021-11-02
CN113593560B true CN113593560B (en) 2024-04-16

Family

ID=78251985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110865579.8A Active CN113593560B (en) 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device

Country Status (1)

Country Link
CN (1) CN113593560B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
KR20170119152A (en) * 2016-04-18 2017-10-26 한양대학교 산학협력단 Ensemble of Jointly Trained Deep Neural Network-based Acoustic Models for Reverberant Speech Recognition and Method for Recognizing Speech using the same
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
WO2020222935A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Speaker attributed transcript generation
CN112951211A (en) * 2021-04-22 2021-06-11 中国科学院声学研究所 Voice awakening method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9183198B2 (en) * 2013-03-19 2015-11-10 International Business Machines Corporation Customizable and low-latency interactive computer-aided translation
US11282500B2 (en) * 2019-07-19 2022-03-22 Cisco Technology, Inc. Generating and training new wake words

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
KR20170119152A (en) * 2016-04-18 2017-10-26 한양대학교 산학협력단 Ensemble of Jointly Trained Deep Neural Network-based Acoustic Models for Reverberant Speech Recognition and Method for Recognizing Speech using the same
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
WO2020222935A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Speaker attributed transcript generation
CN112951211A (en) * 2021-04-22 2021-06-11 中国科学院声学研究所 Voice awakening method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Segment-level Training of ANNs Based on Acoustic Confidence Measures for Hybrid HMM/ANN Speech Recognition;S. Pavankumar Dubagunta等;ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK;第6435-6439页 *
基于汉语语音音位的非特定人命令词识别算法研究;张秋余;赵彦敏;李建海;;科学技术与工程(第08期);第 64、66、74页 *
语音关键词检测系统中基于时长和边界信息的置信度;李文昕;屈丹;李弼程;王炳锡;;应用科学学报(第06期);第34-40页 *

Also Published As

Publication number Publication date
CN113593560A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
US11475881B2 (en) Deep multi-channel acoustic modeling
Zeng et al. Effective combination of DenseNet and BiLSTM for keyword spotting
US9754584B2 (en) User specified keyword spotting using neural network feature extractor
CN106940998B (en) Execution method and device for setting operation
US11069352B1 (en) Media presence detection
CN108885870A (en) For by combining speech to TEXT system with speech to intention system the system and method to realize voice user interface
CN106710599A (en) Particular sound source detection method and particular sound source detection system based on deep neural network
KR20130133858A (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
US20210358497A1 (en) Wakeword and acoustic event detection
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
Lugosch et al. DONUT: CTC-based query-by-example keyword spotting
CN112074903A (en) System and method for tone recognition in spoken language
CN112559797A (en) Deep learning-based audio multi-label classification method
Kumar et al. Machine learning based speech emotions recognition system
Qu et al. Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading
CN115312033A (en) Speech emotion recognition method, device, equipment and medium based on artificial intelligence
US11769491B1 (en) Performing utterance detection using convolution
US11763806B1 (en) Speaker recognition adaptation
Elbarougy Speech emotion recognition based on voiced emotion unit
CN110853669B (en) Audio identification method, device and equipment
CN112767921A (en) Voice recognition self-adaption method and system based on cache language model
CN113593560B (en) Customizable low-delay command word recognition method and device
Tawaqal et al. Recognizing five major dialects in Indonesia based on MFCC and DRNN
Dua et al. Gujarati language automatic speech recognition using integrated feature extraction and hybrid acoustic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant