CN113593560B

CN113593560B - Customizable low-delay command word recognition method and device

Info

Publication number: CN113593560B
Application number: CN202110865579.8A
Authority: CN
Inventors: 司玉景; 李全忠; 何国涛; 蒲瑶
Original assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Current assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2024-04-16
Anticipated expiration: 2041-07-29
Also published as: CN113593560A

Abstract

The invention relates to a customizable low-delay command word recognition method and device, comprising the steps of obtaining voice to be recognized and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is a pinyin with tone; calculating the confidence coefficient of each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence and the time point. The invention can model all the tone-modulated pinyin in Chinese, adopts a simple and efficient scoring mechanism to complete the recognition task of the low-delay command word list, and reduces the development cost and the time cost of command word recognition. The confidence coefficient calculating method adopted by the invention has extremely low calculating complexity and space complexity, higher accuracy and lower false wake-up rate, and can detect whether the command word appears in real time.

Description

Customizable low-delay command word recognition method and device

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a customizable low-delay command word recognition method and device.

Background

In recent years, with the continuous development of information technology and the internet of things, voice is the most direct and convenient man-machine interaction method, and people are getting more attention. Command word recognition is an important field of speech recognition and is widely used in speech command control systems. The task of the low-latency command word recognition system is to automatically find and locate a number of pre-specified command words in a continuous piece of speech, and the whole process is real-time, i.e. the system needs to give a corresponding result immediately once the command words appear. However, unlike a document in a conventional text format, voice data is used as a code for a sound signal, and it is difficult for a computer to directly extract the data form of effective information. In addition, developing an efficient command word recognition system is complicated and difficult due to the influence of a variety of extrinsic phonemes (e.g., background noise, speaker speech speed, accent, etc.).

In the related art, command word recognition systems can be classified into customizable systems and non-customizable systems according to whether a command word is customizable. The customizable characteristic is reflected in that the command word detection model does not depend on the command words appointed by the user, so that when the user modifies the command word list, the model does not need to be retrained; and in a non-customizable command word system, a command word list is related to a model, and when a user wants to modify the command word list, the user needs to acquire the record and the label of the command word again and then perform model training again, so that the time cost and the development cost are obviously increased. Existing command word recognition techniques include Dynamic Time Warping (DTW), hidden Markov Model (HMM) based methods, and deep learning based methods. The keyword/filter framework based on the hmm+dnn can achieve the purpose of customizing the command word, but the effect is inferior to that of a deep learning-based method, the decoding computation complexity is high, and the memory occupation is large.

Disclosure of Invention

In view of the above, the present invention aims to overcome the shortcomings of the prior art, and provide a customizable low-latency command word recognition method and device, so as to solve the problems of poor customizable system effect, high decoding computation complexity and more memory occupation in the prior art.

In order to achieve the above purpose, the invention adopts the following technical scheme: a customizable low-latency command word recognition method comprising:

acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized;

inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;

calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;

and judging whether to output the command word according to the confidence degree and the time point.

Further, the method further comprises the following steps: building a neural network classification model, the building the neural network classification model comprising:

acquiring voice data from a training voice library, and labeling the voice data with a corresponding modeling unit;

acquiring acoustic features corresponding to the voice data;

inputting the acoustic features corresponding to the voice data into a neural network for training, and obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;

and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, so as to generate a neural network classification model.

Further, a calculation formula adopted for calculating the confidence coefficient corresponding to each command word according to the posterior probability is as follows:

wherein p is _ik When the time point k is represented, the posterior probability corresponding to the ith modeling unit; h is a _max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t _i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units that the command word includes.

Further, the determining whether to output the command word according to the confidence level and the time point includes:

comparing the confidence coefficient with a preset threshold value;

if the confidence coefficient of the command word is larger than or equal to a preset threshold value, and the time point of the modeling unit included in the command word meets a time preset condition, outputting the command word.

Further, if the confidence coefficient of the plurality of command words is greater than or equal to a preset threshold value and the time point of occurrence of the modeling unit included in the command words meets a time preset condition, outputting the command word with the highest confidence coefficient.

Further, before labeling the corresponding modeling unit for the voice data, the method further includes:

and modeling the tone pinyin corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.

Further, before determining the acoustic feature to be processed according to the voice to be recognized, the method further includes:

and carrying out noise reduction treatment on the voice to be recognized.

Further, the neural network classification model is as follows

Deep feed forward sequence memory neural networks.

The embodiment of the application provides a customizable low-delay command word recognition device, which comprises:

the acquisition module is used for acquiring the voice to be recognized and determining the acoustic characteristics to be processed according to the voice to be recognized;

the identification module is used for inputting the acoustic features into a pre-constructed neural network classification model for identification and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;

the calculation module is used for calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;

and the output module is used for judging whether to output the command word according to the confidence degree and the time point.

Further, the method further comprises the following steps:

the building module is used for building a neural network classification model;

the constructing the neural network classification model comprises the following steps:

acquiring acoustic features corresponding to the voice data;

and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling unit corresponding to the voice data, so as to generate a neural network classification model.

By adopting the technical scheme, the invention has the following beneficial effects:

the invention provides a customizable low-delay command word recognition method and device, which adopts a neural network model to model all the toned pinyin in Chinese, combines the advantage of large posterior probability differentiation of the output of a connection time sequence classification criterion, adopts a simple and efficient scoring mechanism, and provides the customizable low-delay command word recognition method. The method models pinyin information of voice signals by using a deep feed-forward sequence memory neural network (DFSMN) and a connection time sequence classification criterion (CTC), trains models by utilizing massive voice data, and recognizes command word lists by adopting the trained models. In addition, the invention adopts a simple and efficient scoring mechanism to complete the task of identifying the low-delay command word list. Aiming at the demand of changing command words, the invention does not need to retrain a model, only needs to provide the tone spelling information corresponding to the command words, and greatly reduces the development cost and the time cost of the command word recognition system.

The confidence coefficient calculating method provided by the invention aims at the CTC model, has extremely low calculating complexity and space complexity, has higher accuracy and lower false wake-up rate, and has lower time delay, and whether the command word appears or not can be detected in real time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram illustrating steps of a customizable low-latency command word recognition method of the present invention;

FIG. 2 is a flow chart of a customizable low latency command word recognition method of the present invention;

FIG. 3 is a schematic diagram of a customizable low latency command word recognition device of the present invention;

FIG. 4 is a schematic diagram of a hardware operating environment computer device according to the customizable low-latency command word recognition method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

A specific customizable low-latency command word recognition method and apparatus provided in embodiments of the present application are described below with reference to the accompanying drawings.

As shown in fig. 1, the customizable low-latency command word recognition method provided in the embodiment of the present application includes:

s101, acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized;

it will be appreciated that speech is uttered by the user; for example, a user may say that the intelligent household refrigerator is "opened", and then "opened" is the voice to be recognized, in this application, the voice to be recognized needs to be processed to obtain the acoustic feature of the voice to be recognized, where the processing manner may be implemented by using the existing technology, for example, the voice to be recognized is preprocessed, windowed, FFT transformed, mel filter, and other steps are performed to extract the acoustic feature of the voice to be recognized. Wherein the preprocessing may be acoustic denoising.

S102, inputting the acoustic features into a pre-constructed neural network classification model for recognition, and acquiring posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;

the method comprises the steps of training a neural network classification model in advance, and then inputting the obtained acoustic features into the neural network classification model for calculation to obtain posterior probability of a modeling unit (toned pinyin). The posterior probability is one of the basic concepts of the information theory. In a communication system, after a certain message is received, the probability that the message is transmitted is known to the receiving end as a posterior probability.

S103, calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability;

the confidence coefficient calculating method is provided for the neural network classification model, so that the calculating complexity and the space complexity are reduced, and the accuracy is high.

And S104, judging whether to output the command word according to the confidence degree and the time point.

Finally, decision judgment is performed according to the confidence level to output command words, such as outputting 'open refrigerator'.

The working principle of the customizable low-delay command word recognition method is as follows: referring to fig. 2, a neural network classification model is firstly constructed, then a voice to be recognized is obtained, and acoustic characteristics to be processed are determined according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence degree and the time point. Wherein the neural network classification model is a deep feed forward sequence memory neural network (DFSMN). It is understood that the neural network classification model in the present application may also adopt models such as LSTM, GRU, etc., which are not limited herein. The modeling unit in the method is constructed based on initials, finals and intonation, so that a better Chinese character recognition function can be obtained, recognition accuracy is improved, and recognition errors are reduced.

In some embodiments, further comprising: building a neural network classification model, the building the neural network classification model comprising:

acquiring acoustic features corresponding to the voice data;

Specifically, a neural network classification model is built in advance, and the method comprises the specific processes of collecting and labeling voice data, preprocessing the collected and labeled voice data, windowing, FFT (fast Fourier transform), mel filter and the like, extracting acoustic features for model training, inputting the acoustic features corresponding to the voice data into a neural network for training, obtaining posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, namely taking labeled toned pinyin as output, performing iterative training on the acoustic features corresponding to the voice data by adopting a time sequence classification loss function (CTC), completing training of model parameters by a deep learning method under massive data, and obtaining a classification model of a usable deep feedforward sequence memory neural network.

In some embodiments, the present application calculates the confidence coefficient corresponding to each command word, that is, the probability that each command word may appear, by using the posterior probability of the toned pinyin calculated by the neural network classification model, and presumes that the command word includes n toned pinyins (because the robustness of the CTC model is better, the present invention adopts the original posterior probability, and omits the step of posterior smoothing).

The calculation formula adopted for calculating the confidence coefficient corresponding to each command word according to the posterior probability is as follows:

wherein p is _ik Ith modeling at representation time point kPosterior probability of a cell correspondence; h is a _max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t _i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units that the command word includes.

In some embodiments, the determining whether to output the command word according to the confidence level and the time point includes:

comparing the confidence coefficient with a preset threshold value;

Preferably, if the confidence coefficient of the plurality of command words is greater than or equal to a preset threshold value and a time preset condition is met, the command word with the highest confidence coefficient is output.

Specifically, when the confidence level is greater than or equal to a preset threshold value and the time sequence is satisfied, outputting the command word, and if a plurality of command words simultaneously satisfy the above conditions, outputting the command word with the highest confidence level. That is, when a certain command word is detected at the time point t, the command word needs to satisfy the following conditions simultaneously:

f(t)≥threshold (3)

t ₁ ≤t ₂ ≤…≤t _n (4)

wherein threshold is a preset threshold, t ₁ For the time point corresponding to the first modeling unit in the command word, t ₂ For the time point corresponding to the second modeling unit in the command word, t _n The time point corresponding to the nth modeling unit in the command word. For example, if the command word with the highest confidence is "open the refrigerator", the command word is first converted into the pinyin da3 kai1 bing1 xiang1", where the time point corresponding to the modeling unit" da3 "is t ₁ The time point corresponding to the modeling unit "kai1" is t ₂ The corresponding time point of the modeling unit 'bing 1' is t ₃ Corresponding to the modeling unit "xiang1Time point t ₄ At this time, t must be satisfied ₁ ≤t ₂ ≤t ₃ ≤t ₄ And when the modeling unit contained in the command word 'open refrigerator' meets the time preset condition, outputting a command of 'open refrigerator'. And if the confidence coefficient of the plurality of command words is larger than a preset threshold value and the time point of occurrence of the modeling unit contained in the plurality of command words meets a time preset condition, outputting the command word corresponding to the maximum confidence coefficient in the confidence coefficient of the plurality of command words.

In some embodiments, before labeling the corresponding modeling unit for the voice data, the method further includes:

The modeling unit in the text recognition method and the text recognition device are namely tone pinyin, and comprise initials, finals and tones, so that the text recognition accuracy can be improved.

Preferably, before determining the acoustic feature to be processed according to the voice to be recognized, the method further includes:

and carrying out noise reduction treatment on the voice to be recognized.

Before acoustic features are extracted from the voice to be recognized, denoising processing is performed on the voice to be recognized, and noise interference is removed.

As shown in fig. 3, an embodiment of the present application provides a customizable low-latency command word recognition device, including:

the acquisition module 301 is configured to acquire a voice to be recognized, and determine an acoustic feature to be processed according to the voice to be recognized;

the identifying module 302 is configured to input the acoustic feature into a pre-constructed neural network classification model for identification, and obtain a posterior probability of each modeling unit to which the acoustic feature belongs; wherein the modeling unit is tone pinyin and comprises initials, finals and tones;

a calculating module 303, configured to calculate, according to the posterior probability, a confidence level corresponding to each command word and a time point at which a modeling unit included in the confidence level appears;

and the output module 304 is configured to determine whether to output the command word according to the confidence level and the time point.

The working principle of the customizable low-delay command word recognition device provided by the application is that an acquisition module 301 acquires voice to be recognized and determines acoustic characteristics to be processed according to the voice to be recognized; the recognition module 302 inputs the acoustic features into a pre-constructed neural network classification model for recognition, and obtains posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; the calculating module 303 calculates the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; the output module 304 determines whether to output the command word according to the confidence and the time point.

Preferably, the customizable low-latency command word recognition device provided in the present application further includes:

the building module is used for building a neural network classification model;

acquiring acoustic features corresponding to the voice data;

The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;

the memory is used for storing a computer program, and the computer program is used for executing the customizable low-delay command word recognition method provided by any embodiment;

the processor is used to call and execute the computer program in the memory. The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among others, in a computer readable medium. The computer device stores an operating system, with memory being an example of a computer-readable medium. The computer program, when executed by the processor, causes the processor to perform the customizable low-latency command word recognition method, such as the architecture shown in fig. 4, is merely a block diagram of portions of the architecture relevant to the present application and is not limiting of the computer device to which the present application is applied, a particular computer device may include more or fewer components than shown in the figures, or may combine certain components, or have a different arrangement of components.

In one embodiment, the customizable low-latency command word recognition method provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in FIG. 4.

In some embodiments, the computer program, when executed by the processor, causes the processor to perform the steps of: acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence degree and the time point.

The present application also provides a computer storage medium, examples of which include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassette storage or other magnetic storage devices, or any other non-transmission medium, that can be used to store information that can be accessed by a computing device.

In some embodiments, the present invention also proposes a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the steps of: acquiring voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone pinyin and comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence degree and the time point.

In summary, the invention provides a customizable low-delay command word recognition method and device, which comprises the steps of obtaining voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and obtaining posterior probability of each modeling unit to which the acoustic features belong; wherein the modeling unit is tone spelling, which comprises initials, finals and tones; calculating the confidence coefficient corresponding to each command word and the time point of the modeling unit included in the confidence coefficient according to the posterior probability; and judging whether to output the command word according to the confidence and the time point. The invention can model all the tone-modulated pinyin in Chinese, adopts a simple and efficient scoring mechanism to complete the recognition task of the low-delay command word list, and greatly reduces the development cost and the time cost of command word recognition. The confidence coefficient calculating method adopted by the invention has extremely low calculating complexity and space complexity, higher accuracy and lower false wake-up rate, and in addition, has lower time delay, and can detect whether the command word appears in real time.

It can be understood that the above-provided method embodiments correspond to the above-described apparatus embodiments, and corresponding specific details may be referred to each other and will not be described herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A customizable low-latency command word recognition method, comprising:

judging whether to output the command word according to the confidence coefficient and the time point;

further comprises: building a neural network classification model, the building the neural network classification model comprising:

acquiring acoustic features corresponding to the voice data;

iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, and generating a neural network classification model;

wherein p is _ik When the time point k is represented, the posterior probability corresponding to the ith modeling unit; h is a _max =t-window_size represents the start point of command word detection; window_size represents a time window for command word detection, and the average duration of the command words is taken; t is t _i Representing a time point with maximum posterior probability corresponding to an ith modeling unit in a command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word;

the neural network classification model is a deep feedforward sequence memory neural network.

2. The method of claim 1, wherein said determining whether to output the command word based on the confidence level and the time point comprises:

comparing the confidence coefficient with a preset threshold value;

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

and if the confidence coefficient of the plurality of command words is greater than or equal to a preset threshold value and the time preset condition is met, outputting the command word with the maximum confidence coefficient.

4. The method of claim 1, further comprising, prior to labeling the speech data with the corresponding modeling unit:

5. The method of claim 1, wherein before determining the acoustic feature to be processed from the speech to be recognized, further comprising:

and carrying out noise reduction treatment on the voice to be recognized.

6. A customizable low-latency command word recognition device, comprising:

the output module is used for judging whether to output the command word according to the confidence coefficient and the time point;

the building module is used for building a neural network classification model;

acquiring acoustic features corresponding to the voice data;

based on the posterior probability of each modeling unit to which the acoustic feature corresponding to the voice data belongs and the modeling unit corresponding to the voice data, performing iterative training on the acoustic feature corresponding to the voice data by adopting a time sequence classification loss function, and generating a neural network classification model;