CN112530410A

CN112530410A - Command word recognition method and device

Info

Publication number: CN112530410A
Application number: CN202011547352.0A
Authority: CN
Inventors: 单长浩
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-03-19

Abstract

Disclosed are a command word recognition method, device, computer-readable storage medium and electronic device, the method comprising: extracting voice features in the voice signals; determining a multi-dimensional acoustic feature corresponding to the voice feature based on a first neural network; determining first acoustic information corresponding to the multi-dimensional acoustic features based on a second neural network; processing the first acoustic information based on an attention mechanism to obtain second acoustic information based on historical acoustic information; acquiring phoneme probability distribution of the voice characteristics according to the first acoustic information and the second acoustic information; determining a command word in the speech signal based on the phoneme probability distribution. Under the condition of not increasing the memory occupation and the calculation amount, the method and the device effectively improve the command word recognition rate, reduce the command word false alarm rate and have good command word recognition performance.

Description

Command word recognition method and device

Technical Field

The present application relates to the field of speech recognition technology, and more particularly, to a command word recognition method and device.

Background

As a common human-computer interaction technology, voice recognition can be widely applied to various electronic products, is popular in the market in a natural and convenient interaction mode, and gradually becomes one of mainstream interaction control modes in the era of intelligent products. Among them, command word recognition is a very important aspect in speech recognition for recognizing a command word in a speech signal so that a corresponding command can be executed according to the command word.

Command word recognition requires maintaining a high recognition rate and low false recognition with less memory and computation. However, the current command word recognition device cannot achieve high performance while meeting the memory and the calculation amount, so that the command word recognition rate and the command word false alarm rate are difficult to meet the use requirement.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a command word recognition method, a device, a computer readable storage medium and an electronic device, wherein voice features are processed through a first neural network and a second neural network, and an attention mechanism is introduced, so that the command word recognition rate can be effectively improved, the command word false alarm rate is reduced, and the processing performance is effectively improved.

According to a first aspect of the present application, there is provided a command word recognition method including:

extracting voice features in the voice signals;

determining a multi-dimensional acoustic feature corresponding to the voice feature based on a first neural network;

determining first acoustic information corresponding to the multi-dimensional acoustic features based on a second neural network;

processing the first acoustic information based on an attention mechanism to obtain second acoustic information based on historical acoustic information;

acquiring phoneme probability distribution of the voice characteristics according to the first acoustic information and the second acoustic information;

determining a command word in the speech signal based on the phoneme probability distribution.

According to a second aspect of the present application, there is provided a command word recognition apparatus including:

the feature extraction module is used for extracting voice features in the voice signals;

the first neural network module is used for determining the multidimensional voice characteristics corresponding to the voice characteristics;

the second neural network module is used for determining first acoustic information corresponding to the multi-dimensional voice features;

an attention mechanism module, configured to process the first acoustic information to obtain second acoustic information based on historical acoustic information;

a phoneme probability distribution obtaining module, configured to obtain a phoneme probability distribution of the voice feature according to the first acoustic information and the second acoustic information;

a decoder module to determine a command word in the speech signal based on the phoneme probability distribution.

According to a third aspect of the present application, there is provided a computer-readable storage medium storing a computer program for executing the above-described command word recognition method.

According to a fourth aspect of the present application, there is provided an electronic apparatus comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instruction from the memory and executing the instruction to realize the command word recognition method.

Compared with the prior art, the command word recognition method, the command word recognition device, the computer-readable storage medium and the electronic device provided by the application have the following beneficial effects that: the method and the device have the advantages that the multidimensional acoustic features are sequentially processed through the first neural network and the second neural network, so that more accurate first acoustic information is obtained, an attention mechanism is introduced to process the first acoustic information, more accurate second acoustic information is obtained, the first acoustic information and the second acoustic information are considered simultaneously when phoneme probability distribution of the voice features is obtained, and therefore under the condition that memory occupation and calculated amount are not increased, the command word recognition rate is effectively improved, the command word false alarm rate is reduced, and good command word recognition performance is achieved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a flowchart illustrating a command word recognition method according to an exemplary embodiment of the present application.

Fig. 2 is a schematic flow chart of step 10 in the embodiment shown in fig. 1.

Fig. 3 is a first flowchart of step 30 in the embodiment shown in fig. 1.

Fig. 4 is a schematic structural diagram of the time-delay neural network corresponding to step 30 in the embodiment shown in fig. 3.

FIG. 5 is a second flowchart illustrating step 30 of the embodiment shown in FIG. 1.

Fig. 6 is a schematic structural diagram of the time-delay neural network corresponding to step 30 in the embodiment shown in fig. 5.

Fig. 7 is a schematic flow chart of step 50 in the embodiment shown in fig. 1.

Fig. 8 is a schematic diagram of a command word recognition device according to an exemplary embodiment of the present application.

Fig. 9 is a block diagram of an electronic device provided in an exemplary embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are merely some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited to the example embodiments described herein.

Summary of the application

In the process of interacting with the intelligent product in a voice mode, voice recognition needs to be carried out on a user, command word recognition is an important aspect in the voice recognition, and by recognizing the command words in the voice of the user, corresponding commands can be executed according to the command words obtained through recognition, so that the voice control of the intelligent product is realized.

The memory is required to be occupied for calculation in the command word recognition process, the larger the occupied memory is, the higher the requirement on hardware is, the larger the corresponding calculation amount is, and the longer the command word recognition process is, so that the memory and calculation amount used in the command word recognition process are reduced as much as possible, and meanwhile, the direction of improving the command word recognition performance is achieved by keeping a high recognition rate under the condition of low false recognition.

Currently, when command word recognition is performed, a neural network serving as an acoustic model generally includes CDNN (convolutional neural network-deep neural network), TDNN (time-delay neural network), CTDNN (convolutional neural network-time-delay neural network), and the like, and then a decoder is used to decode to obtain a text corresponding to speech so as to determine whether a speech signal contains a command word. However, the existing acoustic model for command word recognition cannot meet the requirements of low memory and calculation amount and simultaneously achieve high performance, so that the command word recognition rate and the command word false alarm rate are difficult to meet the use requirements.

The embodiment provides a brand new command word recognition method, which extracts voice features in a collected voice signal through a brand new acoustic model, obtains phoneme probability distribution of the voice features, and determines command words in the voice signal based on the phoneme probability distribution, so that the command word recognition rate is effectively improved, the false alarm rate of the command words is reduced, and the command word recognition performance is good under the condition that the memory occupation and the calculated amount are not increased.

Having described the basic concepts of the present application, various non-limiting embodiments of the present solution are described in detail below with reference to the accompanying drawings.

Exemplary method

The embodiment can be applied to electronic equipment, and particularly can be applied to a server or a general computer. As shown in fig. 1, a command word recognition method provided in an exemplary embodiment of the present application at least includes the following steps:

step 10: speech features in the speech signal are extracted.

Specifically, sound collection may be performed by a microphone or a microphone array set in advance, and a corresponding voice signal may be generated from the collected sound. Of course, the voice signal may also be a voice signal that has been acquired in advance, and need not be acquired in real time, which is not limited herein. Because the obtained speech signal is usually a time sequence signal with indefinite duration, and is not suitable for being used as the input of a neural network, speech feature extraction needs to be carried out on the speech signal to obtain a speech feature sequence, wherein the speech feature sequence comprises a plurality of feature vectors, and each vector corresponds to the speech feature of the speech with the preset length.

Step 20: and determining multi-dimensional acoustic features corresponding to the voice features based on the first neural network.

The first neural network may be a Convolutional Neural Network (CNN). The convolutional neural network has a good effect in the aspect of extracting the voice features as a deep network structure with more complex nonlinear transformation capability, and can extract acoustic information in the voice features from multiple dimensions, so that more abundant and detailed multi-dimensional acoustic features can be extracted. Meanwhile, the extracted acoustic information is rich, so that the convergence of the neural network model is promoted in the training stage of the convolutional neural network. In the embodiment, time domain convolution is adopted, so that the calculation amount of the convolutional neural network can be reduced. Of course, in other embodiments, the first neural network may be other types of neural networks, and is not limited herein.

Step 30: and determining first acoustic information corresponding to the multi-dimensional acoustic features based on a second neural network.

The second neural network can utilize the context information of the current multi-dimensional acoustic features, and the features of the hidden layer of the second neural network are not only related to the input multi-dimensional acoustic features at the current moment, but also related to the multi-dimensional acoustic features input at the past moment and the future moment, so that more accurate acoustic information can be acquired according to the input multi-dimensional acoustic feature sequence to output the first acoustic information.

Step 40: based on an attention mechanism, the first acoustic information is processed to obtain second acoustic information based on historical acoustic information.

In the process of processing the first acoustic information through the attention mechanism, the attention weight can be calculated based on the current first acoustic information and the previous and subsequent history and future acoustic information, and the specific area can be focused, so that more accurate second acoustic information can be obtained.

Step 50: and acquiring phoneme probability distribution of the voice characteristics according to the first acoustic information and the second acoustic information.

After the first acoustic information and the second acoustic information are acquired, the first acoustic information and the second acoustic information are combined and then processed, and a phoneme probability distribution can be acquired, wherein phonemes are minimum speech units divided according to natural attributes of speech. For each speech feature, the phoneme probability distribution is the probability that the speech feature corresponds to each phoneme.

Step 60: determining a command word in the speech signal based on the phoneme probability distribution.

After obtaining the phoneme probability distribution corresponding to each voice feature, selecting phonemes meeting a preset rule from the phoneme probability distribution corresponding to each voice feature as an output result of each voice feature, wherein the phonemes corresponding to all the voice features form a phoneme sequence, and obtaining a corresponding word sequence as a voice corresponding text according to the phoneme sequence so as to determine whether a command word is contained in the text. It is understood that when the command word recognition result is a command word, the command word is output; and when the command word recognition result is a non-command word, outputting the non-command word.

The command word recognition method provided by the embodiment of the application has the beneficial effects that at least: according to the method and the device, the multi-dimensional acoustic features are sequentially processed through the first neural network and the second neural network, so that more accurate first acoustic information is obtained, an attention mechanism is introduced to process the first acoustic information, more accurate second acoustic information is obtained, the first acoustic information and the second acoustic information are considered when the phoneme probability distribution of the voice features is obtained, the command word recognition rate is effectively improved under the condition that the memory occupation and the calculated amount are not increased, the command word false alarm rate is reduced, and the command word recognition performance is good.

Fig. 2 is a flowchart illustrating step 10 of a command word recognition method according to an exemplary embodiment of the present application. As shown in fig. 2, the step of extracting the speech feature in the speech signal in an exemplary embodiment of the present application at least includes the following steps:

step 101: and performing acoustic processing on the acquired voice signal to acquire processed voice.

In this embodiment, the voice signal may be collected through a microphone disposed at a preset position, so that the voice of the user may be acquired at any time. The original speech signal typically contains a variety of noise that can cause significant interference to the speech signal. In order to improve the accuracy of subsequent speech feature extraction, acoustic processing such as speech enhancement needs to be performed on the acquired original speech signal, including speech noise reduction, reverberation elimination, echo elimination, and the like, so as to obtain a processed speech. The speech enhancement is a technology for extracting useful speech signals from noise backgrounds when the speech signals are interfered by various noises and even submerged, so as to inhibit and reduce the noise interference, and thus, cleaner and more reliable acoustic features can be extracted. According to the characteristics of voice and noise, methods such as spectral subtraction, wiener filtering, Kalman filtering and the like can be adopted for voice enhancement.

Step 102: and extracting voice features of the processed voice to obtain a plurality of multidimensional vectors of a time sequence, wherein each multidimensional vector is an acoustic feature of the voice frequency with a preset time length.

Since the speech signal contains rich information, such as phonemes, prosody, language, speech content, and the like, it is necessary to extract acoustic features from the speech signal, and common acoustic features include Mel-frequency cepstrum coefficients (MFCC), perceptual linear prediction coefficients (PLP), filter banks (FBank), speech spectrogram, and the like. In the present embodiment, the extracted acoustic features may be set as needed.

For example, the acoustic features may be FBank features extracted from the processed speech. The FBank features are essentially logarithmic feature spectrums, comprise low-frequency and high-frequency information, are processed by a Mel filter bank, are compressed according to the auditory perception characteristics of human ears, and inhibit redundant information which cannot be perceived by partial auditory perception. The extraction process of the FBank features comprises the following steps: (1) the speech signal is processed by pre-emphasis, framing and windowing, and then is processed by short-time Fourier transform to obtain the frequency spectrum of the speech signal. Wherein the pre-emphasis is used to compensate for the amplitude of the high frequency part of the speech signal; in the framing, the speech signal is considered to have short-time stationarity, and usually a speech signal segment of 10 ms-30 ms is a quasi-steady state process, so that the speech signal is framed, the frame length can be 20ms or 25ms, and the frame shift can be 10 ms; commonly used window functions include rectangular windows, hamming windows, hanning windows, blackman windows, and the like. (2) And (3) solving the square of the frequency spectrum obtained in the step (1), and superposing the energy in each filtering frequency band to obtain the power spectrum output by each filter. (3) And taking logarithm of the output power of each filter to obtain a logarithmic power spectrum of the corresponding frequency band.

For another example, the MFCC feature extraction process includes: (1) the speech signal is processed by pre-emphasis, framing and windowing, and then is processed by short-time Fourier transform to obtain the frequency spectrum of the speech signal. Wherein the pre-emphasis is used to compensate for the amplitude of the high frequency part of the speech signal; in the framing, the speech signal is considered to have short-time stationarity, and usually a speech signal segment of 10 ms-30 ms is a quasi-steady state process, so that the speech signal is framed, the frame length can be 20ms or 25ms, and the frame shift can be 10 ms; commonly used window functions include rectangular windows, hamming windows, hanning windows, blackman windows, and the like. (2) And (3) solving the square of the frequency spectrum obtained in the step (1), and superposing the energy in each filtering frequency band to obtain the power spectrum output by each filter. (3) And taking logarithm of the output power of each filter to obtain a logarithm power spectrum of a corresponding frequency band, and performing inverse discrete cosine transform to obtain a plurality of MFCC coefficients. (4) And further calculating to obtain an MFCC characteristic value, wherein the MFCC characteristic value can be used as a static characteristic, and then performing first-order and second-order difference on the static characteristic to obtain a corresponding dynamic characteristic.

After the above-mentioned acoustic feature extraction process, multi-dimensional vectors of a multi-frame time series can be obtained, where each multi-dimensional vector is an acoustic feature of an audio with a preset time length (e.g., 25 ms).

Fig. 3 is a flowchart illustrating step 30 of the command word recognition method according to an exemplary embodiment of the present application. In this embodiment, the second Neural Network may be a Time Delay Neural Network (TDNN), the Time Delay Neural Network is a multilayer structure, each layer of the Time Delay Neural Network includes a full connection layer, a ReLU activation function, and a layer normalization Network, an input of each layer is obtained through a lower layer, and after data processing of the multilayer structure, the obtained first acoustic information is output.

As shown in fig. 3, in an exemplary embodiment of the present application, the step of determining the first acoustic information corresponding to the multi-dimensional speech feature based on the second neural network at least includes the following steps:

step 301: and normalizing the multi-dimensional acoustic features by adopting a layer normalization network to acquire first intermediate information.

The neural network relates to the superposition of a multi-layer network structure, the parameter updating of each layer can cause the distribution of input data of an upper layer to change, the input distribution of the upper layer can be changed very severely through the layer-by-layer superposition, if normalization processing is not carried out, the distribution of each training data is different, and the convergence of the neural network is difficult. In order to reduce the influence of distribution change, normalization processing needs to be performed on input data, and also needs to be added in an intermediate layer of the neural network, and the layer normalization network can effectively promote model convergence in the training process of the time delay neural network.

Step 302: and carrying out nonlinear processing on the first intermediate information by adopting an activation function to obtain second intermediate information.

When the first intermediate information corresponding to the multi-dimensional acoustic feature is processed by using the activation function, the first intermediate information can be mapped to the hidden layer feature space.

Step 303: and connecting the second intermediate information through a full connection layer to output first acoustic information.

Each node of the fully connected layers (FC) is connected with all nodes of the previous layer, and is used for integrating the extracted features of the previous layer to play a role of a classifier in the whole neural network. It should be understood that the order of the above steps can be adjusted as required, and is not limited to the above order.

The following description is given by way of example. Referring to fig. 4, in an embodiment, the delay neural network has a 6-layer structure, the layers are sequentially connected, an information node of a current frame (denoted as 0) is sequentially used as an input node of a first layer, a node formed by splicing information nodes of the current frame, a past frame, and a future frame (denoted as (-1,0,1)) is sequentially used as an input node of a second layer, a node formed by splicing information nodes of the current frame, the past frame, and the future frame (denoted as (-1,0,1)) is used as an input node of a third layer, a node formed by splicing information nodes of a past third frame, a past frame, and a future frame (denoted as (-3, -1,1)) is used as an input node of a fourth layer, a node formed by splicing information nodes of the current frame, the past second frame, and the future second frame (denoted as (-2,0,2) node formed by splicing the information nodes of the past frame, the future frame and the future third frame (marked as (-1,1,3)) is used as the input node of the sixth layer. Each layer structure is processed through the steps 301 to 303, and the first acoustic information is output after the last layer structure is processed. It can be seen that, in the process of extracting the acoustic information, not only the information at the current moment but also the information before and after the current moment are considered, so that more accurate acoustic information can be obtained. It should be understood that, in other embodiments, the number of layers of the delay neural network is not limited to the 6-layer structure described above, and may also be other number of layers of structures, which is not limited herein.

Fig. 5 is another flowchart illustrating step 30 of the command word recognition method according to an exemplary embodiment of the present application. The second Neural Network may be a Time Delay Neural Network (TDNN) having a multilayer structure, each layer of the Time Delay Neural Network includes a fully connected layer, a ReLU activation function, and a layer normalization Network, and an input of each layer is obtained through a lower layer. In order to avoid the acoustic information with lower dimensionality from covering the acoustic information with high dimensionality, the time delay neural network further comprises a residual error network (ResNet), and the obtained more accurate first acoustic information is output after the data processing of the multilayer structure.

As shown in fig. 5, in another exemplary embodiment of the present application, based on the second neural network, the step of determining the first acoustic information corresponding to the multidimensional acoustic feature at least includes the following steps:

The neural network relates to the superposition of a multi-layer network structure, the parameter updating of each layer can cause the distribution of input data of an upper layer to change, the input distribution of the upper layer can be changed very severely through the layer-by-layer superposition, and if normalization processing is not carried out, the distribution of training data is different every time, so that the neural network is difficult to converge. In order to reduce the influence of distribution change, normalization processing needs to be performed on input data, and also needs to be added in an intermediate layer of the neural network, and the layer normalization network can effectively promote model convergence in the training process of the time delay neural network.

Each node of the full connection layer is connected with all nodes of the previous layer and is used for integrating the extracted features of the previous layer to play a role of a classifier in the whole neural network.

Step 304: and inputting the acoustic information output by the full connection layer in the shallow time delay neural network into the deep time delay neural network by adopting a residual error network.

For nodes of a full connection layer of a shallow neural network, the nodes can be used as input of a next neural network, and can also be input into a deeper neural network through a residual error network to realize jump-layer input, so that acoustic information with lower dimensionality can be transmitted across layers, and information loss is avoided. It should be understood that the order of the above steps can be adjusted as required, and is not limited to the above order.

The following description is given by way of example. Referring to fig. 6, in an embodiment, the delay neural network has a 6-layer structure, the layers are sequentially connected, an information node of a current frame (denoted as 0) is sequentially used as an input node of a first layer, a node formed by splicing information nodes of the current frame, a past frame, and a future frame (denoted as (-1,0,1)) is sequentially used as an input node of a second layer, a node formed by splicing information nodes of the current frame, the past frame, and the future frame (denoted as (-1,0,1)) is used as an input node of a third layer, a node formed by splicing information nodes of a past third frame, a past frame, and a future frame (denoted as (-3, -1,1)) is used as an input node of a fourth layer, a node formed by splicing information nodes of the current frame, the past second frame, and the future second frame (denoted as (-2,0,2) node formed by splicing the information nodes of the past frame, the future frame and the future third frame (marked as (-1,1,3)) is used as the input node of the sixth layer. When the output of the first layer is input into the second layer, the output of the first layer is also input into the fourth layer through a residual error network; when the output of the second layer is input into the third layer, the output of the second layer is also input into the fifth layer through a residual error network; the output of the third layer is also input to the sixth layer through the residual network when the output of the fourth layer is input. Each layer structure is processed through the steps 301 to 304, and the first acoustic information is output after being processed through the last layer structure. It can be seen that, in the process of extracting the acoustic information, not only the information at the current moment but also the information before and after the current moment are considered, and the acoustic information with lower dimensionality can be transmitted across layers through the residual error network, so that the information loss is avoided, and more accurate acoustic information can be obtained.

It should be understood that, in other embodiments, the number of layers of the delay neural network is not limited to the 6-layer structure described above, and may also be other number of layers of structures, which is not limited herein. The joining position of the residual error network can also be set according to actual needs, and is not limited to the above situation, and is not limited herein.

Fig. 7 is a flowchart illustrating a step 50 in a command word recognition method according to an exemplary embodiment of the present application. As shown in fig. 7, in an exemplary embodiment of the present application, the step of obtaining a phoneme probability distribution of the speech feature according to the first acoustic information and the second acoustic information at least includes the following steps:

step 501: and combining the first acoustic information and the second acoustic information corresponding to each voice feature to obtain third acoustic information.

In this embodiment, the first acoustic information output by the time delay neural network has acoustic information with a higher dimension, the second acoustic information which is more accurate is obtained after the first acoustic information is processed based on an attention mechanism, and the third acoustic information is obtained by combining and splicing the first acoustic information and the second acoustic information together.

Step 502: and processing the third acoustic information to obtain phoneme probability distribution of the voice characteristics.

In this embodiment, the third acoustic information is used as an input of a phoneme probability distribution neural network, where the phoneme probability distribution neural network is obtained through pre-training, and the phoneme probability distribution can be obtained by processing the third acoustic information through the phoneme probability distribution neural network. For each speech feature in the sequence of speech features mentioned by the speech signal, a probability distribution of the phoneme corresponding to the speech feature can be obtained. The phoneme probability distribution is obtained by fully utilizing the first acoustic information and the second acoustic information, so that the phoneme probability distribution has higher precision.

Further, the step of determining a command word in the speech signal based on the phoneme probability distribution in step 60 comprises: and according to the phoneme probability distribution corresponding to the voice characteristics, taking the decoding path with the maximum probability as a recognition result, outputting a command word if the recognition result is the command word, and otherwise, outputting a non-command word.

The method comprises the steps of obtaining a plurality of voice characteristics of a voice signal when the voice signal is subjected to voice characteristic extraction frame by frame after the voice signal is obtained, wherein each voice characteristic corresponds to a phoneme probability distribution, carrying out Viterbi search in a decoding graph in the decoding process, searching out a path with the maximum probability through the phoneme probability distribution, outputting a command word if the path contains the command word, and otherwise, outputting a non-command word.

Exemplary devices

Based on the same conception as the method embodiment of the application, the embodiment of the application also provides command word recognition equipment. Fig. 8 is a schematic structural diagram illustrating a command word recognition apparatus according to an exemplary embodiment of the present application.

The command word recognition apparatus includes a feature extraction module 71, a first neural network module 72, a second neural network module 73, an attention mechanism module 74, a phoneme probability distribution acquisition module 75, and a decoder module 76. The feature extraction module 71 is configured to extract a voice feature in the voice signal; the first neural network module 72 may be a convolutional neural network module, configured to determine a multidimensional speech feature corresponding to the speech feature; the second neural network module 73 may be a time delay neural network module, configured to determine first acoustic information corresponding to the multi-dimensional speech features; an attention mechanism module 74 for processing the first acoustic information to obtain second acoustic information based on historical acoustic information; a phoneme probability distribution obtaining module 75, as a multi-level output module, configured to obtain a phoneme probability distribution of the speech feature according to the first acoustic information and the second acoustic information; the decoder module 76 is configured to determine a command word in the speech signal based on the phoneme probability distribution.

The embodiment of the application provides a brand-new CATDNN network structure, which comprises a Convolutional Neural Network (CNN), a Time Delay Neural Network (TDNN), an Attention mechanism (Attention) and a residual error network (ResNet), a Multi-head output module and other modules, wherein the convolutional neural network and the time delay neural network module sequentially process Multi-dimensional acoustic characteristics, so that more accurate first acoustic information is obtained, the residual error network is convenient for network convergence and performance improvement, the Attention mechanism and the Multi-stage output are arranged to further improve the network structure performance, and under the condition of not increasing the memory occupation and the calculated amount, the command word recognition rate is effectively improved, and the command word false alarm rate is reduced.

Further, the feature extraction module 71 includes an acoustic processing unit and a feature extraction unit. The acoustic processing unit is used for performing acoustic processing on the acquired voice signal to acquire processed voice; the feature extraction unit is used for extracting voice features of the processed voice to obtain a plurality of multi-dimensional vectors of a time sequence, wherein each multi-dimensional vector is an acoustic feature of an audio frequency with a preset time length.

Further, the second neural network module 73 is a multilayer time delay neural network, each layer includes a full connection layer, a ReLU activation function, and a layer normalization network, where the layer normalization network normalizes the multidimensional acoustic features to obtain first intermediate information; the activation function carries out nonlinear processing on the first intermediate information to obtain second intermediate information; the full-link layer links the second intermediate information to output first acoustic information. The second neural network module 73 further includes a residual error network, and the residual error network inputs the acoustic information output by the full-link layer in the shallow delay neural network into the deep delay neural network.

Further, the phoneme probability distribution obtaining module 75 includes an acoustic information obtaining unit and a phoneme probability distribution obtaining unit, where the acoustic information obtaining unit is configured to combine the first acoustic information and the second acoustic information corresponding to each of the speech features to obtain third acoustic information; and the phoneme probability distribution acquisition unit is used for processing the third acoustic information to acquire phoneme probability distribution of the voice characteristics.

Exemplary electronic device

Fig. 9 illustrates a block diagram of an electronic device of an embodiment of the application.

As shown in fig. 9, the electronic device 80 includes one or more processors 801 and memory 802.

The processor 801 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.

Memory 802 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 801 to implement the command word recognition methods of the various embodiments of the present application described above and/or other desired functions.

In one example, the electronic device 80 may further include: an input device 803 and an output device 804, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, the input device 803 may be a microphone or a microphone array for capturing speech. The input device 803 may also be a communication network connector. The input device 803 may also include, for example, a keyboard, a mouse, and the like.

The output device 804 may output various information to the outside, and may include, for example, a display, a speaker, a printer, and a communication network and a remote output apparatus connected thereto.

Of course, for simplicity, only some of the components of the electronic device 80 relevant to the present application are shown in fig. 9, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 80 may include any other suitable components depending on the particular application.

Exemplary computer program productAnd computer-readable storage medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the command word recognition method according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a command word recognition method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A command word recognition method, comprising:

extracting voice features in the voice signals;

2. The method of claim 1, wherein the determining first acoustic information corresponding to the multi-dimensional acoustic feature based on the second neural network comprises:

normalizing the multi-dimensional acoustic features by adopting a layer normalization network to acquire first intermediate information;

carrying out nonlinear processing on the first intermediate information by adopting an activation function to obtain second intermediate information;

and connecting the second intermediate information through a full connection layer to output first acoustic information.

3. The method of claim 2, wherein when the second neural network is a time-lapse neural network and the multi-dimensional acoustic features are processed using a multi-layer time-lapse neural network, the second neural network further comprises a residual network;

the determining, based on the second neural network, first acoustic information corresponding to the multi-dimensional acoustic features further includes:

and inputting the acoustic information output by the full connection layer in the shallow time delay neural network into the deep time delay neural network by adopting a residual error network.

4. The method of claim 1, wherein the obtaining a phoneme probability distribution of the speech feature from the first acoustic information and the second acoustic information comprises:

combining the first acoustic information and the second acoustic information corresponding to each voice feature to obtain third acoustic information;

and processing the third acoustic information to obtain phoneme probability distribution of the voice characteristics.

5. The method of claim 4, wherein the determining a command word in the speech signal based on the phoneme probability distribution comprises: and according to the phoneme probability distribution corresponding to the voice characteristics, taking the decoding path with the maximum probability as a recognition result, outputting a command word if the recognition result is the command word, and otherwise, outputting a non-command word.

6. The method of claim 1, wherein the extracting speech features in the speech signal comprises:

performing acoustic processing on the acquired voice signal to acquire processed voice;

and extracting voice features of the processed voice to obtain a plurality of multidimensional vectors of a time sequence, wherein each multidimensional vector is an acoustic feature of the voice frequency with a preset time length.

7. The method of claim 6, wherein in the step of acoustically processing the acquired speech signal to obtain processed speech, the acoustic processing includes at least one of speech noise reduction, reverberation cancellation, echo cancellation, and speech enhancement.

8. A command word recognition apparatus comprising:

the first neural network module is used for determining multi-dimensional acoustic features corresponding to the voice features;

the second neural network module is used for determining first acoustic information corresponding to the multi-dimensional acoustic features;

9. A computer-readable storage medium storing a computer program for executing the command word recognition method of any one of claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instructions from the memory and executing the instructions to realize the command word recognition method of any one of the claims 1 to 7.