CN113593560A - Customizable low-delay command word recognition method and device - Google Patents

Customizable low-delay command word recognition method and device Download PDF

Info

Publication number
CN113593560A
CN113593560A CN202110865579.8A CN202110865579A CN113593560A CN 113593560 A CN113593560 A CN 113593560A CN 202110865579 A CN202110865579 A CN 202110865579A CN 113593560 A CN113593560 A CN 113593560A
Authority
CN
China
Prior art keywords
command word
modeling unit
acoustic features
voice data
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110865579.8A
Other languages
Chinese (zh)
Other versions
CN113593560B (en
Inventor
司玉景
李全忠
何国涛
蒲瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Original Assignee
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Puqiang Times Zhuhai Hengqin Information Technology Co ltd filed Critical Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority to CN202110865579.8A priority Critical patent/CN113593560B/en
Publication of CN113593560A publication Critical patent/CN113593560A/en
Application granted granted Critical
Publication of CN113593560B publication Critical patent/CN113593560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a customizable low-delay command word recognition method and device, which comprises the steps of obtaining a voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; wherein, the modeling unit is Pinyin with tone; calculating the confidence coefficient of each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point. The method can model all the toned pinyin in the Chinese, adopts a simple and efficient scoring mechanism, completes the identification task of a low-delay command word list, and reduces the development cost and time cost of command word identification. The confidence coefficient calculation method adopted by the invention has extremely low calculation complexity and space complexity, higher accuracy and lower false wake-up rate, and can detect whether the command word appears in real time.

Description

Customizable low-delay command word recognition method and device
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a customizable low-delay command word recognition method and device.
Background
In recent years, with the continuous development of information technology and the internet of things, voice is gaining more and more attention as the most direct and convenient human-computer interaction method. Command word recognition is an important field of speech recognition and is widely used in speech command control systems. The task of the low-latency command word recognition system is to automatically find and locate some pre-specified command words in a continuous speech segment, and the whole process is real-time, i.e. once a command word appears, the system needs to give corresponding results immediately. However, unlike the conventional text-formatted document, the voice data is used as a code for the voice signal, and it is difficult for the computer to directly extract the data form of the effective information. In addition, the development of an effective command word recognition system is complicated and difficult due to the influence of various extrinsic phonemes (e.g., background noise, speaker pace, accent, etc.).
In the related art, the command word recognition system can be classified into a customizable system and a non-customizable system according to the distinction whether the command word is customizable or not. The customizable attributes are that the command word detection model does not depend on the command words appointed by the user, so that the model does not need to be retrained when the user modifies the command word list; the command word list is related to the model, when the user wants to modify the command word list, the user needs to re-collect the recording and labeling of the command word and then re-train the model, which undoubtedly increases the time cost and development cost. Existing command word recognition techniques include a dynamic time warping method (DTW), a Hidden Markov Model (HMM) based method, and a deep learning based method. The keyword filtering (keyword/filter) framework based on the HMM + DNN can achieve the purpose of customizing command words, but the effect is not as good as that based on the deep learning method, the decoding calculation complexity is high, and the memory usage is large.
Disclosure of Invention
In view of the above, the present invention provides a customizable low-latency command word recognition method and apparatus to overcome the shortcomings of the prior art, so as to solve the problems of poor customizable system effect, high decoding computation complexity, and large memory usage in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme: a customizable low-latency command word recognition method, comprising:
acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized;
inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;
calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;
and judging whether to output the command word according to the confidence coefficient and the time point.
Further, the method also comprises the following steps: constructing a neural network classification model, wherein the constructing of the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, and generating a neural network classification model.
Further, the calculation formula for calculating the confidence corresponding to each command word according to the posterior probability is as follows:
Figure BDA0003187226070000031
Figure BDA0003187226070000032
wherein p isikWhen the time point k is represented, the posterior probability corresponding to the ith modeling unit; h ismaxT-window _ size represents the starting point of command word detection; window _ size represents the time window for command word detection, taking the average duration of the command words; t is tiRepresenting a time point with the maximum posterior probability corresponding to the ith modeling unit in the command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word.
Further, the determining whether to output the command word according to the confidence and the time point includes:
comparing the confidence with a preset threshold;
and if the confidence coefficient of the command word is greater than or equal to a preset threshold value and the time point of the modeling unit contained in the command word meets the preset time condition, outputting the command word.
Further, if the confidence degrees of a plurality of command words are larger than or equal to a preset threshold value and the time point of the modeling unit included in the command word meets a preset time condition, the command word with the maximum confidence degree is output.
Further, before labeling the corresponding modeling unit to the voice data, the method further includes:
and modeling the pinyin with tones corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.
Further, before determining the acoustic feature to be processed according to the speech to be recognized, the method further includes:
and carrying out noise reduction processing on the voice to be recognized.
Further, the neural network classification model is
Deep feedforward sequence memory neural network.
The embodiment of the present application provides a customizable low-latency command word recognition device, including:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice to be recognized and determining an acoustic feature to be processed according to the voice to be recognized;
the recognition module is used for inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;
the calculation module is used for calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;
and the output module is used for judging whether to output the command word according to the confidence coefficient and the time point.
Further, the method also comprises the following steps:
the building module is used for building a neural network classification model;
the building of the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling units corresponding to the voice data to generate a neural network classification model.
By adopting the technical scheme, the invention can achieve the following beneficial effects:
the invention provides a customizable low-delay command word recognition method and device, which adopts a neural network model to model all toned pinyin in Chinese, combines the advantage of high differentiation of posterior probability output by a connection time sequence classification criterion, and adopts a simple and efficient scoring mechanism to provide the customizable low-delay command word recognition method. The method comprises the steps of modeling pinyin information of voice signals by using a deep feedforward sequence memory neural network (DFSMN) and a connection timing sequence classification criterion (CTC), training a model by using mass voice data, and recognizing a command word list by using the trained model. In addition, the invention adopts a simple and efficient scoring mechanism to complete the task of identifying the low-delay command word list. Aiming at the requirement of changing the command words, the invention only needs to provide the toned pinyin information corresponding to the command words without retraining the model, thereby greatly reducing the development cost and the time cost of the command word recognition system.
The confidence coefficient calculation method provided by the invention for the CTC model has extremely low calculation complexity and space complexity, higher accuracy and lower false wake-up rate, and in addition, has lower time delay and can detect whether command words appear in real time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating steps of a customizable low-latency command word recognition method according to the present invention;
FIG. 2 is a schematic flow chart illustrating a customizable low-latency command word recognition method according to the present invention;
FIG. 3 is a schematic structural diagram of a customizable low-latency command word recognition apparatus according to the present invention;
fig. 4 is a schematic structural diagram of a computer device in a hardware operating environment related to the customizable low-latency command word recognition method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
A specific customizable low-latency command word recognition method and apparatus provided in the embodiments of the present application are described below with reference to the accompanying drawings.
As shown in fig. 1, the customizable low-latency command word recognition method provided in the embodiment of the present application includes:
s101, acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized;
it is understood that speech is uttered by the user; for example, a user can say that "open the refrigerator" to the intelligent household refrigerator, then "open the refrigerator" is the voice to be recognized, the voice to be recognized needs to be processed in the application to obtain the acoustic features of the voice to be recognized, wherein the processing mode can be realized by adopting the prior art, for example, the voice to be recognized is subjected to preprocessing, windowing, FFT conversion, mel filter and other steps to extract the acoustic features of the voice to be recognized. Wherein the preprocessing may be a sound denoising processing.
S102, inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;
the method trains a neural network classification model in advance, and then inputs the obtained acoustic features into the neural network classification model for calculation to obtain the posterior probability of a modeling unit (Pinyin with tone). Among them, the posterior probability is one of the basic concepts of information theory. In a communication system, the probability that a message is transmitted after being received is known by the receiver as the a posteriori probability.
S103, calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;
the confidence coefficient calculation method is provided for the neural network classification model, so that the calculation complexity and the space complexity are reduced, and the accuracy is high.
And S104, judging whether to output the command word according to the confidence coefficient and the time point.
Finally, decision making is carried out according to the confidence degree, and an output command word is output, for example, "open the refrigerator" is output.
The customizable low-delay command word recognition method has the working principle that: referring to fig. 2, firstly, a neural network classification model is constructed, then, a speech to be recognized is obtained, and acoustic features to be processed are determined according to the speech to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point. Wherein the neural network classification model is a deep feed-forward sequence memory neural network (DFSMN). It is understood that the neural network classification model in the present application may also adopt LSTM, GRU, etc., and the present application is not limited thereto. The modeling unit is constructed based on initials, finals and tones, so that a better Chinese character recognition function can be obtained, the recognition accuracy is improved, and the recognition error is reduced.
In some embodiments, further comprising: constructing a neural network classification model, wherein the constructing of the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, and generating a neural network classification model.
Specifically, the neural network classification model is constructed in advance, the specific process is that voice data are collected and marked, the collected and marked voice data are subjected to preprocessing, windowing, FFT (fast Fourier transform), Mel filter and other steps, acoustic features for model training are extracted, the acoustic features corresponding to the voice data are input into the neural network for training, the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong is obtained, namely, the marked toned pinyin is used as output, a time sequence classification loss function (CTC) is adopted for iterative training of the acoustic features corresponding to the voice data, training of model parameters is completed through a deep learning method under mass data, and the usable classification model of the deep feedforward sequence memory neural network is obtained.
In some embodiments, the present application calculates the posterior probability of the toned pinyin obtained by using the neural network classification model, calculates the confidence corresponding to each command word, that is, the probability that each command word may appear, and assumes that the command word includes n toned pinyins (because the CTC model has better robustness, the present invention adopts the original posterior probability and omits the step of posterior smoothing).
The calculation formula for calculating the confidence corresponding to each command word according to the posterior probability is as follows:
Figure BDA0003187226070000071
Figure BDA0003187226070000072
wherein p isikRepresenting the posterior probability corresponding to the ith modeling unit at the time point k; h ismaxT-window _ size represents the starting point of command word detection; window _ size represents the time window for command word detection, taking the average duration of the command words; t is tiRepresenting a time point with the maximum posterior probability corresponding to the ith modeling unit in the command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word.
In some embodiments, the determining whether to output the command word according to the confidence and the time point includes:
comparing the confidence with a preset threshold;
and if the confidence coefficient of the command word is greater than or equal to a preset threshold value and the time point of the modeling unit contained in the command word meets the preset time condition, outputting the command word.
Preferably, if the confidence of a plurality of command words is greater than or equal to a preset threshold and a preset condition of time is met, the command word with the highest confidence is output.
Specifically, when the confidence is greater than or equal to a preset threshold and the time sequence is satisfied, the command word is output, and if a plurality of command words simultaneously satisfy the above condition, the command word with the highest confidence is output. That is, when a command word is detected at time t, the command word needs to satisfy the following conditions:
f(t)≥threshold (3)
t1≤t2≤…≤tn (4)
wherein, threshold is a preset threshold value t1For the time point, t, corresponding to the first modeling unit in the command word2For the time point, t, corresponding to the second modeling unit in the command wordnAnd the time point corresponding to the nth modeling unit in the command word is obtained. For example, if the command word with the highest confidence is "open refrigerator", the command word is first converted into the toned pinyin da3 kai1 bin 1 xiang1 ", wherein the time point corresponding to the modeling unit" da3 "is t1The time point corresponding to the modeling unit "kai 1" is t2The time point corresponding to the modeling unit "bin 1" is t3The time point corresponding to the modeling unit "xiang 1" is t4At this time, t must be satisfied1≤t2≤t3≤t4If the modeling unit contained in the command word 'open the refrigerator' meets the preset time condition, the command of 'open the refrigerator' is output. And if the confidence degrees of the plurality of command words are larger than a preset threshold value and the time points of the modeling units contained in the plurality of command words meet a preset time condition, outputting the command word corresponding to the maximum confidence degree in the confidence degrees of the plurality of command words.
In some embodiments, before labeling the corresponding modeling unit for the voice data, the method further includes:
and modeling the pinyin with tones corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.
The modeling unit in the application is the pinyin with tones, and comprises initials, finals and tones, so that the character recognition accuracy can be improved.
Preferably, before determining the acoustic feature to be processed according to the speech to be recognized, the method further includes:
and carrying out noise reduction processing on the voice to be recognized.
Before the acoustic features of the voice to be recognized are extracted, denoising processing is carried out on the voice to be recognized, and noise interference is removed.
As shown in fig. 3, an embodiment of the present application provides a customizable low-latency command word recognition apparatus, including:
an obtaining module 301, configured to obtain a speech to be recognized, and determine an acoustic feature to be processed according to the speech to be recognized;
the identification module 302 is configured to input the acoustic features into a pre-constructed neural network classification model for identification, and obtain a posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;
the calculating module 303 is configured to calculate, according to the posterior probability, a confidence corresponding to each command word and a time point at which a modeling unit included in the command word appears;
and the output module 304 is configured to determine whether to output the command word according to the confidence and the time point.
The customizable low-delay command word recognition device provided by the application has the working principle that the acquisition module 301 acquires a voice to be recognized and determines an acoustic feature to be processed according to the voice to be recognized; the recognition module 302 inputs the acoustic features into a pre-constructed neural network classification model for recognition, and obtains the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; the calculating module 303 calculates the confidence corresponding to each command word and the time point of occurrence of the modeling unit included in the command word according to the posterior probability; the output module 304 determines whether to output the command word according to the confidence and the time point.
Preferably, the customizable low-latency command word recognition device provided by the present application further includes:
the building module is used for building a neural network classification model;
the building of the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling units corresponding to the voice data to generate a neural network classification model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
the memory is used for storing a computer program, and the computer program is used for executing the customizable low-latency command word recognition method provided by any one of the above embodiments;
the processor is used to call and execute the computer program in the memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The computer device stores an operating system, and the memory is an example of a computer-readable medium. The computer program, when executed by the processor, causes the processor to perform a customizable low-latency command word recognition method, such as the one shown in fig. 4, which is a block diagram of only a portion of the structure associated with the present solution and does not constitute a limitation on the computing device to which the present solution applies, and a particular computing device may include more or less components than shown in the figures, or combine certain components, or have a different arrangement of components.
In one embodiment, the customizable low-latency command word recognition method provided herein may be implemented in the form of a computer program that is executable on a computer device such as the one shown in fig. 4.
In some embodiments, the computer program, when executed by the processor, causes the processor to perform the steps of: acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point.
The present application also provides a computer storage medium, examples of which include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassette storage or other magnetic storage devices, or any other non-transmission medium, that can be used to store information that can be accessed by a computing device.
In some embodiments, the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the steps of: acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point.
In summary, the present invention provides a customizable low-latency command word recognition method and apparatus, including obtaining a speech to be recognized, and determining an acoustic feature to be processed according to the speech to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point. The invention can model all the toned pinyin in the Chinese, adopts a simple and efficient scoring mechanism, completes the identification task of a low-delay command word list, and greatly reduces the development cost and the time cost of command word identification. The confidence coefficient calculation method adopted by the invention has extremely low calculation complexity and space complexity, higher accuracy and lower false awakening rate, and in addition, has lower time delay and can detect whether the command word appears in real time.
It is to be understood that the embodiments of the method provided above correspond to the embodiments of the apparatus described above, and the corresponding specific contents may be referred to each other, which is not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A customizable low-latency command word recognition method, comprising:
acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized;
inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;
calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;
and judging whether to output the command word according to the confidence coefficient and the time point.
2. The method of claim 1, further comprising: constructing a neural network classification model, wherein the constructing of the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, and generating a neural network classification model.
3. The method according to claim 1 or 2, wherein the calculation formula for calculating the confidence corresponding to each command word according to the posterior probability is as follows:
Figure FDA0003187226060000011
Figure FDA0003187226060000012
wherein p isikWhen the time point k is represented, the posterior probability corresponding to the ith modeling unit; h ismaxT-window _ size represents the starting point of command word detection; window _ size represents the time window for command word detection, taking the average duration of the command words; t is tiRepresenting a time point with the maximum posterior probability corresponding to the ith modeling unit in the command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word.
4. The method of claim 1, wherein determining whether to output the command word according to the confidence level and the time point comprises:
comparing the confidence with a preset threshold;
and if the confidence coefficient of the command word is greater than or equal to a preset threshold value and the time point of the modeling unit contained in the command word meets the preset time condition, outputting the command word.
5. The method of claim 4,
and if the confidence degrees of a plurality of command words are larger than or equal to the preset threshold value and the time preset condition is met, outputting the command word with the maximum confidence degree.
6. The method of claim 2, wherein before labeling the corresponding modeling unit for the speech data, further comprising:
and modeling the pinyin with tones corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.
7. The method of claim 1, wherein prior to determining the acoustic features to be processed based on the speech to be recognized, further comprising:
and carrying out noise reduction processing on the voice to be recognized.
8. The method of claim 2, wherein the neural network classification model is
Deep feedforward sequence memory neural network.
9. A customizable, low-latency command word recognition device, comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice to be recognized and determining an acoustic feature to be processed according to the voice to be recognized;
the recognition module is used for inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;
the calculation module is used for calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;
and the output module is used for judging whether to output the command word according to the confidence coefficient and the time point.
10. The apparatus of claim 9, further comprising:
the building module is used for building a neural network classification model;
the building of the neural network classification model comprises the following steps:
acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;
acquiring acoustic features corresponding to the voice data;
inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;
and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling units corresponding to the voice data to generate a neural network classification model.
CN202110865579.8A 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device Active CN113593560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110865579.8A CN113593560B (en) 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110865579.8A CN113593560B (en) 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device

Publications (2)

Publication Number Publication Date
CN113593560A true CN113593560A (en) 2021-11-02
CN113593560B CN113593560B (en) 2024-04-16

Family

ID=78251985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110865579.8A Active CN113593560B (en) 2021-07-29 2021-07-29 Customizable low-delay command word recognition method and device

Country Status (1)

Country Link
CN (1) CN113593560B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
US20140288913A1 (en) * 2013-03-19 2014-09-25 International Business Machines Corporation Customizable and low-latency interactive computer-aided translation
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
KR20170119152A (en) * 2016-04-18 2017-10-26 한양대학교 산학협력단 Ensemble of Jointly Trained Deep Neural Network-based Acoustic Models for Reverberant Speech Recognition and Method for Recognizing Speech using the same
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
WO2020222935A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Speaker attributed transcript generation
US20210020162A1 (en) * 2019-07-19 2021-01-21 Cisco Technology, Inc. Generating and training new wake words
CN112951211A (en) * 2021-04-22 2021-06-11 中国科学院声学研究所 Voice awakening method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
US20140288913A1 (en) * 2013-03-19 2014-09-25 International Business Machines Corporation Customizable and low-latency interactive computer-aided translation
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
KR20170119152A (en) * 2016-04-18 2017-10-26 한양대학교 산학협력단 Ensemble of Jointly Trained Deep Neural Network-based Acoustic Models for Reverberant Speech Recognition and Method for Recognizing Speech using the same
CN109036412A (en) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 voice awakening method and system
WO2020222935A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Speaker attributed transcript generation
US20200349950A1 (en) * 2019-04-30 2020-11-05 Microsoft Technology Licensing, Llc Speaker Attributed Transcript Generation
US20210020162A1 (en) * 2019-07-19 2021-01-21 Cisco Technology, Inc. Generating and training new wake words
CN112951211A (en) * 2021-04-22 2021-06-11 中国科学院声学研究所 Voice awakening method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
S. PAVANKUMAR DUBAGUNTA等: "Segment-level Training of ANNs Based on Acoustic Confidence Measures for Hybrid HMM/ANN Speech Recognition", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), BRIGHTON, UK, pages 6435 - 6439 *
张秋余;赵彦敏;李建海;: "基于汉语语音音位的非特定人命令词识别算法研究", 科学技术与工程, no. 08, pages 64 *
李文昕;屈丹;李弼程;王炳锡;: "语音关键词检测系统中基于时长和边界信息的置信度", 应用科学学报, no. 06, pages 34 - 40 *

Also Published As

Publication number Publication date
CN113593560B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
US11475881B2 (en) Deep multi-channel acoustic modeling
Zhang et al. Boosting contextual information for deep neural network based voice activity detection
CN106940998B (en) Execution method and device for setting operation
US8930196B2 (en) System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands
US11069352B1 (en) Media presence detection
CN106710599A (en) Particular sound source detection method and particular sound source detection system based on deep neural network
Tong et al. A comparative study of robustness of deep learning approaches for VAD
CN112368769A (en) End-to-end stream keyword detection
US11200885B1 (en) Goal-oriented dialog system
US11205428B1 (en) Deleting user data using keys
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN102945673A (en) Continuous speech recognition method with speech command range changed dynamically
CN112509560B (en) Voice recognition self-adaption method and system based on cache language model
US20240013784A1 (en) Speaker recognition adaptation
US20230377574A1 (en) Word selection for natural language interface
CN112825250A (en) Voice wake-up method, apparatus, storage medium and program product
CN110853669A (en) Audio identification method, device and equipment
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
US11769491B1 (en) Performing utterance detection using convolution
Li et al. Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement.
CN113593560B (en) Customizable low-delay command word recognition method and device
CN113744734A (en) Voice wake-up method and device, electronic equipment and storage medium
CN113658593B (en) Wake-up realization method and device based on voice recognition
Nasiri et al. Audiomask: Robust sound event detection using mask r-cnn and frame-level classifier
KR20220129034A (en) Small footprint multi-channel keyword spotting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant