CN113593560A

CN113593560A - Customizable low-delay command word recognition method and device

Info

Publication number: CN113593560A
Application number: CN202110865579.8A
Authority: CN
Inventors: 司玉景; 李全忠; 何国涛; 蒲瑶
Original assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Current assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-11-02
Anticipated expiration: 2041-07-29
Also published as: CN113593560B

Abstract

The invention relates to a customizable low-delay command word recognition method and device, which comprises the steps of obtaining a voice to be recognized, and determining acoustic characteristics to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; wherein, the modeling unit is Pinyin with tone; calculating the confidence coefficient of each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point. The method can model all the toned pinyin in the Chinese, adopts a simple and efficient scoring mechanism, completes the identification task of a low-delay command word list, and reduces the development cost and time cost of command word identification. The confidence coefficient calculation method adopted by the invention has extremely low calculation complexity and space complexity, higher accuracy and lower false wake-up rate, and can detect whether the command word appears in real time.

Description

Customizable low-delay command word recognition method and device

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a customizable low-delay command word recognition method and device.

Background

In recent years, with the continuous development of information technology and the internet of things, voice is gaining more and more attention as the most direct and convenient human-computer interaction method. Command word recognition is an important field of speech recognition and is widely used in speech command control systems. The task of the low-latency command word recognition system is to automatically find and locate some pre-specified command words in a continuous speech segment, and the whole process is real-time, i.e. once a command word appears, the system needs to give corresponding results immediately. However, unlike the conventional text-formatted document, the voice data is used as a code for the voice signal, and it is difficult for the computer to directly extract the data form of the effective information. In addition, the development of an effective command word recognition system is complicated and difficult due to the influence of various extrinsic phonemes (e.g., background noise, speaker pace, accent, etc.).

In the related art, the command word recognition system can be classified into a customizable system and a non-customizable system according to the distinction whether the command word is customizable or not. The customizable attributes are that the command word detection model does not depend on the command words appointed by the user, so that the model does not need to be retrained when the user modifies the command word list; the command word list is related to the model, when the user wants to modify the command word list, the user needs to re-collect the recording and labeling of the command word and then re-train the model, which undoubtedly increases the time cost and development cost. Existing command word recognition techniques include a dynamic time warping method (DTW), a Hidden Markov Model (HMM) based method, and a deep learning based method. The keyword filtering (keyword/filter) framework based on the HMM + DNN can achieve the purpose of customizing command words, but the effect is not as good as that based on the deep learning method, the decoding calculation complexity is high, and the memory usage is large.

Disclosure of Invention

In view of the above, the present invention provides a customizable low-latency command word recognition method and apparatus to overcome the shortcomings of the prior art, so as to solve the problems of poor customizable system effect, high decoding computation complexity, and large memory usage in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme: a customizable low-latency command word recognition method, comprising:

acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized;

inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;

calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;

and judging whether to output the command word according to the confidence coefficient and the time point.

Further, the method also comprises the following steps: constructing a neural network classification model, wherein the constructing of the neural network classification model comprises the following steps:

acquiring voice data from a training voice library, and labeling a corresponding modeling unit for the voice data;

acquiring acoustic features corresponding to the voice data;

inputting the acoustic features corresponding to the voice data into a neural network for training, and acquiring the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong;

and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong, and generating a neural network classification model.

Further, the calculation formula for calculating the confidence corresponding to each command word according to the posterior probability is as follows:

wherein p is_ikWhen the time point k is represented, the posterior probability corresponding to the ith modeling unit; h is_maxT-window _ size represents the starting point of command word detection; window _ size represents the time window for command word detection, taking the average duration of the command words; t is t_iRepresenting a time point with the maximum posterior probability corresponding to the ith modeling unit in the command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word.

Further, the determining whether to output the command word according to the confidence and the time point includes:

comparing the confidence with a preset threshold;

and if the confidence coefficient of the command word is greater than or equal to a preset threshold value and the time point of the modeling unit contained in the command word meets the preset time condition, outputting the command word.

Further, if the confidence degrees of a plurality of command words are larger than or equal to a preset threshold value and the time point of the modeling unit included in the command word meets a preset time condition, the command word with the maximum confidence degree is output.

Further, before labeling the corresponding modeling unit to the voice data, the method further includes:

and modeling the pinyin with tones corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.

Further, before determining the acoustic feature to be processed according to the speech to be recognized, the method further includes:

and carrying out noise reduction processing on the voice to be recognized.

Further, the neural network classification model is

Deep feedforward sequence memory neural network.

The embodiment of the present application provides a customizable low-latency command word recognition device, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice to be recognized and determining an acoustic feature to be processed according to the voice to be recognized;

the recognition module is used for inputting the acoustic features into a pre-constructed neural network classification model for recognition, and obtaining the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;

the calculation module is used for calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;

and the output module is used for judging whether to output the command word according to the confidence coefficient and the time point.

Further, the method also comprises the following steps:

the building module is used for building a neural network classification model;

the building of the neural network classification model comprises the following steps:

acquiring acoustic features corresponding to the voice data;

and iteratively training the acoustic features corresponding to the voice data by adopting a time sequence classification loss function based on the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong and the modeling units corresponding to the voice data to generate a neural network classification model.

By adopting the technical scheme, the invention can achieve the following beneficial effects:

the invention provides a customizable low-delay command word recognition method and device, which adopts a neural network model to model all toned pinyin in Chinese, combines the advantage of high differentiation of posterior probability output by a connection time sequence classification criterion, and adopts a simple and efficient scoring mechanism to provide the customizable low-delay command word recognition method. The method comprises the steps of modeling pinyin information of voice signals by using a deep feedforward sequence memory neural network (DFSMN) and a connection timing sequence classification criterion (CTC), training a model by using mass voice data, and recognizing a command word list by using the trained model. In addition, the invention adopts a simple and efficient scoring mechanism to complete the task of identifying the low-delay command word list. Aiming at the requirement of changing the command words, the invention only needs to provide the toned pinyin information corresponding to the command words without retraining the model, thereby greatly reducing the development cost and the time cost of the command word recognition system.

The confidence coefficient calculation method provided by the invention for the CTC model has extremely low calculation complexity and space complexity, higher accuracy and lower false wake-up rate, and in addition, has lower time delay and can detect whether command words appear in real time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating steps of a customizable low-latency command word recognition method according to the present invention;

FIG. 2 is a schematic flow chart illustrating a customizable low-latency command word recognition method according to the present invention;

FIG. 3 is a schematic structural diagram of a customizable low-latency command word recognition apparatus according to the present invention;

fig. 4 is a schematic structural diagram of a computer device in a hardware operating environment related to the customizable low-latency command word recognition method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

A specific customizable low-latency command word recognition method and apparatus provided in the embodiments of the present application are described below with reference to the accompanying drawings.

As shown in fig. 1, the customizable low-latency command word recognition method provided in the embodiment of the present application includes:

s101, acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized;

it is understood that speech is uttered by the user; for example, a user can say that "open the refrigerator" to the intelligent household refrigerator, then "open the refrigerator" is the voice to be recognized, the voice to be recognized needs to be processed in the application to obtain the acoustic features of the voice to be recognized, wherein the processing mode can be realized by adopting the prior art, for example, the voice to be recognized is subjected to preprocessing, windowing, FFT conversion, mel filter and other steps to extract the acoustic features of the voice to be recognized. Wherein the preprocessing may be a sound denoising processing.

S102, inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;

the method trains a neural network classification model in advance, and then inputs the obtained acoustic features into the neural network classification model for calculation to obtain the posterior probability of a modeling unit (Pinyin with tone). Among them, the posterior probability is one of the basic concepts of information theory. In a communication system, the probability that a message is transmitted after being received is known by the receiver as the a posteriori probability.

S103, calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability;

the confidence coefficient calculation method is provided for the neural network classification model, so that the calculation complexity and the space complexity are reduced, and the accuracy is high.

And S104, judging whether to output the command word according to the confidence coefficient and the time point.

Finally, decision making is carried out according to the confidence degree, and an output command word is output, for example, "open the refrigerator" is output.

The customizable low-delay command word recognition method has the working principle that: referring to fig. 2, firstly, a neural network classification model is constructed, then, a speech to be recognized is obtained, and acoustic features to be processed are determined according to the speech to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point. Wherein the neural network classification model is a deep feed-forward sequence memory neural network (DFSMN). It is understood that the neural network classification model in the present application may also adopt LSTM, GRU, etc., and the present application is not limited thereto. The modeling unit is constructed based on initials, finals and tones, so that a better Chinese character recognition function can be obtained, the recognition accuracy is improved, and the recognition error is reduced.

In some embodiments, further comprising: constructing a neural network classification model, wherein the constructing of the neural network classification model comprises the following steps:

acquiring acoustic features corresponding to the voice data;

Specifically, the neural network classification model is constructed in advance, the specific process is that voice data are collected and marked, the collected and marked voice data are subjected to preprocessing, windowing, FFT (fast Fourier transform), Mel filter and other steps, acoustic features for model training are extracted, the acoustic features corresponding to the voice data are input into the neural network for training, the posterior probability of each modeling unit to which the acoustic features corresponding to the voice data belong is obtained, namely, the marked toned pinyin is used as output, a time sequence classification loss function (CTC) is adopted for iterative training of the acoustic features corresponding to the voice data, training of model parameters is completed through a deep learning method under mass data, and the usable classification model of the deep feedforward sequence memory neural network is obtained.

In some embodiments, the present application calculates the posterior probability of the toned pinyin obtained by using the neural network classification model, calculates the confidence corresponding to each command word, that is, the probability that each command word may appear, and assumes that the command word includes n toned pinyins (because the CTC model has better robustness, the present invention adopts the original posterior probability and omits the step of posterior smoothing).

The calculation formula for calculating the confidence corresponding to each command word according to the posterior probability is as follows:

wherein p is_ikRepresenting the posterior probability corresponding to the ith modeling unit at the time point k; h is_maxT-window _ size represents the starting point of command word detection; window _ size represents the time window for command word detection, taking the average duration of the command words; t is t_iRepresenting a time point with the maximum posterior probability corresponding to the ith modeling unit in the command word detection time window; f (t) represents confidence; n represents the number of modeling units included in the command word.

In some embodiments, the determining whether to output the command word according to the confidence and the time point includes:

comparing the confidence with a preset threshold;

Preferably, if the confidence of a plurality of command words is greater than or equal to a preset threshold and a preset condition of time is met, the command word with the highest confidence is output.

Specifically, when the confidence is greater than or equal to a preset threshold and the time sequence is satisfied, the command word is output, and if a plurality of command words simultaneously satisfy the above condition, the command word with the highest confidence is output. That is, when a command word is detected at time t, the command word needs to satisfy the following conditions:

f(t)≥threshold (3)

t₁≤t₂≤…≤t_n (4)

wherein, threshold is a preset threshold value t₁For the time point, t, corresponding to the first modeling unit in the command word₂For the time point, t, corresponding to the second modeling unit in the command word_nAnd the time point corresponding to the nth modeling unit in the command word is obtained. For example, if the command word with the highest confidence is "open refrigerator", the command word is first converted into the toned pinyin da3 kai1 bin 1 xiang1 ", wherein the time point corresponding to the modeling unit" da3 "is t₁The time point corresponding to the modeling unit "kai 1" is t₂The time point corresponding to the modeling unit "bin 1" is t₃The time point corresponding to the modeling unit "xiang 1" is t₄At this time, t must be satisfied₁≤t₂≤t₃≤t₄If the modeling unit contained in the command word 'open the refrigerator' meets the preset time condition, the command of 'open the refrigerator' is output. And if the confidence degrees of the plurality of command words are larger than a preset threshold value and the time points of the modeling units contained in the plurality of command words meet a preset time condition, outputting the command word corresponding to the maximum confidence degree in the confidence degrees of the plurality of command words.

In some embodiments, before labeling the corresponding modeling unit for the voice data, the method further includes:

The modeling unit in the application is the pinyin with tones, and comprises initials, finals and tones, so that the character recognition accuracy can be improved.

Preferably, before determining the acoustic feature to be processed according to the speech to be recognized, the method further includes:

and carrying out noise reduction processing on the voice to be recognized.

Before the acoustic features of the voice to be recognized are extracted, denoising processing is carried out on the voice to be recognized, and noise interference is removed.

As shown in fig. 3, an embodiment of the present application provides a customizable low-latency command word recognition apparatus, including:

an obtaining module 301, configured to obtain a speech to be recognized, and determine an acoustic feature to be processed according to the speech to be recognized;

the identification module 302 is configured to input the acoustic features into a pre-constructed neural network classification model for identification, and obtain a posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones;

the calculating module 303 is configured to calculate, according to the posterior probability, a confidence corresponding to each command word and a time point at which a modeling unit included in the command word appears;

and the output module 304 is configured to determine whether to output the command word according to the confidence and the time point.

The customizable low-delay command word recognition device provided by the application has the working principle that the acquisition module 301 acquires a voice to be recognized and determines an acoustic feature to be processed according to the voice to be recognized; the recognition module 302 inputs the acoustic features into a pre-constructed neural network classification model for recognition, and obtains the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; the calculating module 303 calculates the confidence corresponding to each command word and the time point of occurrence of the modeling unit included in the command word according to the posterior probability; the output module 304 determines whether to output the command word according to the confidence and the time point.

Preferably, the customizable low-latency command word recognition device provided by the present application further includes:

the building module is used for building a neural network classification model;

acquiring acoustic features corresponding to the voice data;

The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;

the memory is used for storing a computer program, and the computer program is used for executing the customizable low-latency command word recognition method provided by any one of the above embodiments;

the processor is used to call and execute the computer program in the memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The computer device stores an operating system, and the memory is an example of a computer-readable medium. The computer program, when executed by the processor, causes the processor to perform a customizable low-latency command word recognition method, such as the one shown in fig. 4, which is a block diagram of only a portion of the structure associated with the present solution and does not constitute a limitation on the computing device to which the present solution applies, and a particular computing device may include more or less components than shown in the figures, or combine certain components, or have a different arrangement of components.

In one embodiment, the customizable low-latency command word recognition method provided herein may be implemented in the form of a computer program that is executable on a computer device such as the one shown in fig. 4.

In some embodiments, the computer program, when executed by the processor, causes the processor to perform the steps of: acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point.

The present application also provides a computer storage medium, examples of which include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassette storage or other magnetic storage devices, or any other non-transmission medium, that can be used to store information that can be accessed by a computing device.

In some embodiments, the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to perform the steps of: acquiring a voice to be recognized, and determining an acoustic feature to be processed according to the voice to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point.

In summary, the present invention provides a customizable low-latency command word recognition method and apparatus, including obtaining a speech to be recognized, and determining an acoustic feature to be processed according to the speech to be recognized; inputting the acoustic features into a pre-constructed neural network classification model for identification, and acquiring the posterior probability of each modeling unit to which the acoustic features belong; the modeling unit is a toned pinyin and comprises initials, finals and tones; calculating the confidence corresponding to each command word and the time point of the appearance of the modeling unit contained in the command word according to the posterior probability; and judging whether to output the command word according to the confidence coefficient and the time point. The invention can model all the toned pinyin in the Chinese, adopts a simple and efficient scoring mechanism, completes the identification task of a low-delay command word list, and greatly reduces the development cost and the time cost of command word identification. The confidence coefficient calculation method adopted by the invention has extremely low calculation complexity and space complexity, higher accuracy and lower false awakening rate, and in addition, has lower time delay and can detect whether the command word appears in real time.

It is to be understood that the embodiments of the method provided above correspond to the embodiments of the apparatus described above, and the corresponding specific contents may be referred to each other, which is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A customizable low-latency command word recognition method, comprising:

2. The method of claim 1, further comprising: constructing a neural network classification model, wherein the constructing of the neural network classification model comprises the following steps:

acquiring acoustic features corresponding to the voice data;

3. The method according to claim 1 or 2, wherein the calculation formula for calculating the confidence corresponding to each command word according to the posterior probability is as follows:

4. The method of claim 1, wherein determining whether to output the command word according to the confidence level and the time point comprises:

comparing the confidence with a preset threshold;

5. The method of claim 4,

and if the confidence degrees of a plurality of command words are larger than or equal to the preset threshold value and the time preset condition is met, outputting the command word with the maximum confidence degree.

6. The method of claim 2, wherein before labeling the corresponding modeling unit for the speech data, further comprising:

7. The method of claim 1, wherein prior to determining the acoustic features to be processed based on the speech to be recognized, further comprising:

and carrying out noise reduction processing on the voice to be recognized.

8. The method of claim 2, wherein the neural network classification model is

Deep feedforward sequence memory neural network.

9. A customizable, low-latency command word recognition device, comprising:

10. The apparatus of claim 9, further comprising:

the building module is used for building a neural network classification model;

acquiring acoustic features corresponding to the voice data;