CN111816166A

CN111816166A - Voice recognition method, apparatus, and computer-readable storage medium storing instructions

Info

Publication number: CN111816166A
Application number: CN202010694750.9A
Authority: CN
Inventors: 黎吉国; 许继征; 张莉; 王悦; 马思伟
Original assignee: Peking University; ByteDance Inc
Current assignee: Peking University; ByteDance Inc
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2020-10-23

Abstract

A voice recognition method, apparatus, and computer-readable storage medium storing instructions are provided. The voice recognition method comprises the following steps: acquiring time domain characteristics of input audio; acquiring frequency domain characteristics of the input audio; and fusing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio, and executing sound identification based on the fused characteristics.

Description

Voice recognition method, apparatus, and computer-readable storage medium storing instructions

Technical Field

The present disclosure relates to the field of voice recognition technology, and in particular, to a voice recognition method and a voice recognition apparatus.

Background

Voice recognition is a technique that analyzes the sound emitted by an object and compares it with the sounds in a database of sounds to determine which object the object is. Voice recognition may have a variety of applications, for example, it may be applied to speaker recognition, biometric recognition, gender/age recognition, and the like. Speaker recognition is a biometric technology, also known as voiceprint recognition, which has an important position in the field of speech processing because it can be widely applied to the fields of biometric authentication, security, and the like. At present, the recognition effect of the traditional voice recognition scheme is relatively limited, and the voice recognition effect needs to be further improved.

Disclosure of Invention

The embodiment of the disclosure discloses a voice recognition method to improve the voice recognition effect.

According to an aspect of the present disclosure, there is provided a voice recognition method including: acquiring time domain characteristics of input audio; acquiring frequency domain characteristics of the input audio; and fusing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio, and executing sound identification based on the fused characteristics.

Optionally, the fusing the time-domain feature of the input audio and the frequency-domain feature of the input audio may include: and splicing and transforming the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain the fused characteristics.

Optionally, the splicing and transforming the time-domain feature of the input audio and the frequency-domain feature of the input audio to obtain the fused feature may include: splicing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain spliced characteristics; and executing two-layer full-connection layer transformation on the spliced features to obtain the fused features.

Optionally, the splicing and transforming the time-domain feature of the input audio and the frequency-domain feature of the input audio to obtain the fused feature may include: performing a layer of full-link layer transform on the time domain features of the input audio to obtain first transform features; performing one-layer full-connection layer transformation on the frequency domain characteristics of the input audio to obtain second transformation characteristics; splicing the first transformation characteristic and the second transformation characteristic to obtain a spliced characteristic; and executing one-layer full-connection layer transformation on the spliced features to obtain the fused features.

Optionally, the splicing and transforming the time-domain feature of the input audio and the frequency-domain feature of the input audio to obtain the fused feature may include: performing two-layer full-link layer transformation on the time domain characteristics of the input audio to obtain third transformation characteristics; performing two-layer full-connected layer transformation on the frequency domain characteristics of the input audio to obtain fourth transformation characteristics; and splicing the third transformation characteristic and the fourth transformation characteristic to obtain the fused characteristic.

According to another aspect of the present disclosure, there is provided a voice recognition apparatus including: a time domain feature acquisition module configured to acquire a time domain feature of an input audio; a frequency domain feature acquisition module configured to acquire a frequency domain feature of the input audio; a voice recognition module configured to fuse the time-domain features of the input audio and the frequency-domain features of the input audio and perform voice recognition based on the fused features.

Optionally, the voice recognition module may be configured to: and splicing and transforming the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain the fused characteristics.

Optionally, the voice recognition module may be configured to: splicing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain spliced characteristics; and executing two-layer full-connection layer transformation on the spliced features to obtain the fused features.

Optionally, the voice recognition module may be configured to: performing a layer of full-link layer transform on the time domain features of the input audio to obtain first transform features; performing one-layer full-connection layer transformation on the frequency domain characteristics of the input audio to obtain second transformation characteristics; splicing the first transformation characteristic and the second transformation characteristic to obtain a spliced characteristic; and executing one-layer full-connection layer transformation on the spliced features to obtain the fused features.

Optionally, the voice recognition module may be configured to: performing two-layer full-link layer transformation on the time domain characteristics of the input audio to obtain third transformation characteristics; performing two-layer full-connected layer transformation on the frequency domain characteristics of the input audio to obtain fourth transformation characteristics; and splicing the third transformation characteristic and the fourth transformation characteristic to obtain the fused characteristic.

According to another aspect of the present disclosure, there is provided a voice recognition apparatus comprising a system of at least one computing device and at least one storage device having stored thereon computer instructions which, when executed by the at least one computing device, cause the at least one computing device to perform a voice recognition method according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having instructions stored thereon, which, when executed on at least one computing device, cause the at least one computing device to perform a voice recognition method according to the present disclosure.

According to the voice recognition method and the voice recognition device of the exemplary embodiment of the present disclosure, voice recognition is performed by using time domain information and frequency domain information of a voice signal together in a fusion manner, time information and frequency information of the voice signal are fully used, and the performance of voice recognition is improved. For example, when the voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure are applied to speaker recognition, voiceprint recognition is performed jointly using time domain information and frequency domain information of a speech signal by a fusion manner, time information and frequency information of the speech signal are fully used, and performance of speaker recognition is improved.

In addition, according to the voice recognition method and the voice recognition apparatus of the exemplary embodiment of the present disclosure, the time domain features and the frequency domain features are transformed together to the classification feature space by an early fusion manner, and thus the transformation process is performed by synthesizing the time domain features and the frequency domain features, that is, the features of all two domains of the audio signal are more comprehensively considered, and thus, a good voice recognition effect can be achieved.

Drawings

These and/or other aspects and advantages of the present disclosure will become apparent and more readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIGS. 1a and 1b are schematic diagrams illustrating a prior speaker recognition model.

Fig. 2 is a schematic diagram illustrating a Time-frequency network (TFN) model according to an exemplary embodiment of the present disclosure.

Fig. 3a, 3b and 3c show schematic diagrams of a fusion approach according to an exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a voice recognition method according to an exemplary embodiment of the present disclosure.

Fig. 5 illustrates a block diagram of a voice recognition apparatus according to an exemplary embodiment of the present disclosure.

FIGS. 6a, 6b, 6c, and 6d illustrate schematic diagrams of a speaker recognition system.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

In the field of voice recognition, for example, in speaker recognition applications, current speaker recognition models may include time domain models and frequency domain models, depending on the different types of input data. As shown in fig. 1a and 1b, fig. 1a and 1b are schematic diagrams illustrating an existing speaker recognition model. As shown in FIG. 1a, the time domain model uses the original speech waveform (a time domain representation of the speech signal) as input, that is, the time domain model performs speaker recognition using only the time domain information of the original speech. As shown in fig. 1b, the frequency domain model uses a spectral signal (a kind of frequency domain representation of the speech signal) as an input, that is, the frequency domain model performs speaker recognition using only frequency domain information of the speech information. Therefore, the effect of the current speaker recognition model cannot be optimal. The frequency domain model and the time domain model will be described in detail below.

Frequency domain model

Before Deep Neural Networks (DNNs) were used, most speaker recognition methods classified speech signals using frequency domain features, such as Gaussian Mixture Models (GMMs), i-vector feature representations of speech segments. These methods are performed based on manual Frequency domain features, such as Filter Banks (FBANK) or Mel-Frequency Cepstral coefficients (MFCC). With the widespread use of DNNs, DNNs have also been designed for automatic extraction of frequency domain features for speaker recognition. However, these methods all process only frequency domain signals and ignore time domain information.

For example, the gist of a conventional frequency domain speaker recognition method may include one of the following: (1) combining a GMM super Vector with a Support Vector Machine (SVM) and deriving a linear kernel based on a KL distance approximation method between two GMM models; (2) modeling speakers and channel variability and providing a new low-dimensional global representation of voice, called unit vector or i-vector, which is the basis of a frequency domain method for most speaker recognition; (3) based on i-vector, frame alignment is generated by using the pre-trained DNN, and the equal error rate is improved by 30% compared with the traditional system; (4) the x-vector is introduced by training the DNN based on frequency features with data enhancement (such as FBANK) to extract a global vector of fixed length. These methods all use the spectral signal of the speech signal as input, for example, mel-frequency cepstral coefficients (MFCCs), Perceptual Linear Predictive (PLP) analysis, Linear Predictive Cepstral Coefficients (LPCCs), etc. These spectra, although also containing temporal information, have a significantly reduced temporal resolution compared to the temporal resolution of the original speech signal due to Time-frequency transforms, such as Short-Time Fourier transforms (STFTs). In particular, a window-based time-frequency transform (such as a short-time fourier transform) transforms a signal segment into the frequency domain using a window with a step size to produce time-frequency features, while the time resolution will drop from N to N/step size (where// represents a rounding operation). Therefore, the conventional speaker recognition method using a frequency spectrum as an input cannot well learn the temporal characteristics, and thus cannot sufficiently utilize time domain information.

Time domain model

While Convolutional Neural Networks (CNNs) successfully solve the large-scale image classification problem and exhibit powerful capabilities in high-dimensional data modeling, CNNs are used to solve speaker recognition problems directly in the time domain. In recent years, end-to-end models designed to extract features directly from the original speech waveform have demonstrated better performance than traditional methods that use only frequency domain features. For example, the sincenet model exhibits good performance in speaker recognition by designing the first layer filter as a learnable band pass filter.

However, recent deep learning-based methods using an original speech signal as an input cannot learn frequency characteristics well because frequency optimization or frequency transformation is not applied in the framework. In particular, deep neural networks learn many small convolution filters on the time axis, and speech signals can be classified using models. That is, in the conventional speaker recognition model based on the neural network, the filter is learned only from the original speech signal having only the time axis. Therefore, the frequency domain signal is ignored. These methods may include: (1) performing speaker recognition based on CNN by extracting only time domain information of the original speech using an end-to-end approach, (2) combining CNN with Long Short Term Memory (LSTM) network to extract global vectors from the original speech signal to perform speaker recognition, (3) sincenet model uses a learnable band pass filter to replace the first layer of CNN to obtain better interoperability to improve performance. Here, although the sincent model uses the learnable band pass filter to utilize the frequency information of the speech signal, it still uses only the original speech waveform as input, and thus it cannot sufficiently utilize the frequency domain information.

However, for sound signal analysis, frequency domain information and time domain information are important, and the best effect of sound signal analysis cannot be achieved due to the lack of information in any one domain. To fully exploit time-domain features and frequency-domain information, learning time-domain feature representations and frequency-domain feature representations using shared or unshared branches, the present disclosure proposes a completely new time-frequency network (TFN) model that takes both the original audio waveform and spectrum as inputs. Specifically, the TFN model designed by the present disclosure may include a time-domain branch model for extracting time-domain information of an original audio waveform, a frequency-domain branch model for extracting frequency-domain information of a spectrum signal, and a fusion model fusing the time-domain information and the frequency-domain information to perform voice recognition, and an output of the TFN model may be a predicted distribution of a character emitting a voice (for example, in the case of being applied to speaker recognition, an output of the TFN model may be a predicted distribution of a speaker). Hereinafter, a voice recognition method and a voice recognition apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 2 to 6 d.

Fig. 2 is a schematic diagram illustrating a TFN model according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, the TFN model 200 may include three sub-models, i.e., a time-domain branch model 201, a frequency-domain branch model 202, and a fusion model 203.

The time-tap model 201 may be designed to extract time-domain features from the original audio waveform of the input audio. The temporal branch model 201 may be implemented using any available model that extracts temporal features of the input audio, and the disclosure is not limited thereto. For example, the time-branch model 201 may be implemented using a multi-layer CNN or a Recurrent Neural Network (RNN) or the like to extract local features of the time domain from the original audio signal.

According to an exemplary embodiment of the present disclosure, the time-branching model 201 may be designed as a sincenet model (which may also be other common CNN models or RNN models). In this case, the first layer of the time-branching model 201 is designed as a band-pass filter to model the frequency characteristics. Here, the band pass filter may be expressed as formula (1) below.

g[n,f₁,f₂]＝2f₂sinc(2πf₂n)-2f₂sinc(2πf₁n) (1)

Wherein, g 2]Representing the output of the band-pass filter, n representing the size of the kernel of the band-pass filter, f₁Denotes the lower limit of the cut-off frequency, f₂Denotes the upper cutoff frequency limit, sinc (x) sinx/x.

By designing the filter of the first layer of the temporal branching model 201 as a band pass filter, the temporal branching model 201 can be made to have fewer parameters and to have better interpretability.

In addition, other layers of the temporal branching model 201 may be a typical one-dimensional convolutional layer (Conv), as well as a Batch Normalization layer (BN) and an activation function layer (ReLU), and the temporal branching model 201 may output temporal features after passing through several convolutional, Batch Normalization and activation function layers.

The frequency domain branch model 202 may be designed to extract frequency domain features from the spectrum of the input audio. According to an exemplary embodiment of the present disclosure, MFCC, PLP, LPCC, or the like may be used as a spectrum for extracting frequency domain features. The frequency domain branch model 202 may also be implemented using any available model that extracts frequency domain features from a spectral signal, and the disclosure is not limited thereto. For example, the frequency-domain branch model 202 may be implemented using one-dimensional or two-dimensional multi-layer CNNs to extract frequency-domain features from the spectral signal, or the frequency-domain branch model 202 may be implemented using any GMM, DNN, or RNN to extract frequency-domain features from the spectral signal.

The fusion model 203 may be designed to fuse the time-domain features and the frequency-domain features to perform voice recognition based on the fused features, i.e., to output a prediction distribution result. Here, the fusion refers to transforming the time-domain features and the frequency-domain features together into a classification feature space to obtain classification features (i.e., fused features), and thus the fusion process includes a feature concatenation process and a transformation process. Specifically, the fusion model 203 may perform a stitching and transformation process on the time-domain features of the input audio and the frequency-domain features of the input audio to obtain fused features. For example, the fusion model 203 may include one feature spliced Layer and multiple Fully Connected layers (FC layers). The characteristic splicing layer is used for splicing the two characteristic vectors into one characteristic vector, and the FC layer is used for transforming the characteristic vector. The present disclosure does not limit the number and arrangement of the one-layer feature splice layer and the multi-layer FC layer in the fusion model 203.

According to an exemplary embodiment of the present disclosure, the fusion model 203 may include one feature splice layer and two FC layers. The fusion model 203 can have three different implementations, namely, early fusion, mid-fusion, and late fusion, according to different transform types.

As shown in fig. 3a, the fusion model 203 may employ an early fusion approach. In the early fusion mode, the feature splicing layer is provided at the first layer, and the two FC layers are provided at the second layer and the third layer, respectively. Specifically, the fusion model 203 may first splice together two local features (i.e., the time domain features output by the time-domain branching model 201 and the frequency domain features output by the frequency-domain branching model 202) at a first layer to obtain spliced features (i.e., global features), then pass the spliced features through two FC layers at a second layer and a third layer, respectively, to project (transform) the spliced features into a classification feature space to obtain classification features (i.e., fused features) of the input audio, and perform sound recognition according to the classification features of the input audio. For example, the classification features of the input audio are subjected to softmax processing to obtain a prediction classification result (i.e., probability distribution values). The early fusion mode firstly splices the time domain characteristics and the frequency domain characteristics and then transforms the spliced characteristics, so the transformation process is executed by integrating the time domain characteristics and the frequency domain characteristics, namely, the characteristics of all two domains of the audio signal are considered more comprehensively, and therefore, a good sound identification effect can be achieved.

As shown in fig. 3b, the fusion model 203 may employ a medium-term fusion approach. In the middle-term fusion mode, one FC layer is disposed at the first layer, the characteristic splice layer is disposed at the second layer, and the other FC layer is disposed at the third layer. Specifically, the fusion model 203 may first embed two local features (i.e., the time domain feature output by the time-domain branching model 201 and the frequency domain feature output by the frequency-domain branching model 202) through one FC layer at a first layer, then splice the two FC layer-output features together at a second layer, and finally pass the spliced global feature through one FC layer at a third layer to transform the global feature projection to the classification feature space to obtain the classification feature (i.e., the fused feature) of the input audio, and perform sound recognition according to the classification feature of the input audio. For example, the classification features of the input audio are subjected to softmax processing to obtain a prediction classification result (i.e., probability distribution values).

As shown in fig. 3c, the fusion model 203 may employ a post-fusion approach. In the post-fusion mode, two FC layers are respectively disposed on the first layer and the second layer, and the feature splicing layer is disposed on the third layer. Specifically, the fusion model 204 may first embed two local features (i.e., the time domain feature output by the time-domain branching model 201 and the frequency domain feature output by the frequency-domain branching model 202) through two FC layers at the first layer and the second layer, respectively, to project the two local features to the classification feature space to obtain the classification feature of the time domain feature and the classification feature of the frequency domain feature, respectively, and then splice the classification feature of the time domain feature and the classification feature of the frequency domain feature together at the third layer to splice the classification features in the two low-dimensional classification feature spaces into the classification features in the high-dimensional classification feature space to obtain the global classification feature in the classification feature space, and perform sound recognition according to the global classification feature. For example, the global classification feature is subjected to softmax processing to obtain a predicted classification result (i.e., a probability distribution value).

The fusion model 204 may perform fusion of the time-domain feature of the input audio and the frequency-domain feature of the input audio by using any of the fusion manners described above, and may also perform fusion of the time-domain feature of the input audio and the frequency-domain feature of the input audio by using any other feasible fusion manner, for example, the number of FC layers may not be necessarily two, but may be a single layer, or three or more layers, and the feature splicing layer may be disposed at any position among a plurality of layers, or sound recognition may be performed by directly obtaining the fused feature by only performing splicing on the time-domain feature of the input audio and the frequency-domain feature of the input audio.

Referring to fig. 4, in step 401, a time domain feature of input audio is acquired. Specifically, the time domain features of the input audio may be obtained by extracting the time domain features from an original audio waveform of the input audio. According to an exemplary embodiment of the present disclosure, the step of extracting time domain features from an original audio waveform of the input audio may be performed by the time-branching model 201. According to another exemplary embodiment of the present disclosure, the time domain characteristics of the input audio may be retrieved from a local memory, a server, or the like. The time domain features of the input audio may also be obtained in any other feasible manner, and the present disclosure does not limit the way and source of the obtaining.

According to an exemplary embodiment of the present disclosure, the temporal branching model 201 may be implemented using a multi-layered CNN or RNN, or the like. According to an exemplary embodiment of the present disclosure, the time-branching model 201 may be implemented using a sincenet model. Of course, the time domain feature extraction method is not limited to this, and any method may be used to extract the time domain feature, for example, any multi-layer CNN or RNN may be used to extract the time domain feature.

In step 402, frequency domain features of input audio are obtained. Specifically, the frequency domain features of the input audio may be obtained by performing a time-frequency transform (such as a fast fourier transform, STFT) on an original audio signal of the input audio, and extracting frequency domain features from a spectral signal obtained by the time-frequency transform. According to another exemplary embodiment of the present disclosure, the frequency domain characteristics of the input audio may be obtained from a local storage or a server or the like. The frequency domain features of the input audio may also be obtained in any other feasible manner, and the present disclosure does not limit the way and the source of the obtaining.

According to an exemplary embodiment of the present disclosure, MFCC, PLP, LPCC, or the like may be used as a spectrum for extracting frequency domain features.

According to an exemplary embodiment of the present disclosure, the step of extracting frequency domain features from the spectral signal may be performed by a frequency domain branch model 202. According to an exemplary embodiment of the present disclosure, the frequency domain branching model 202 may be implemented using one-dimensional or two-dimensional multi-layer CNNs. Of course, the manner of extracting the frequency domain features is not limited thereto, and any manner may be used to extract the frequency domain features, for example, any GMM or DNN or the like may also be used to extract the frequency domain features from the spectrum signal.

Further, steps 401 and 402 may be performed sequentially, in reverse order, in parallel, and the present disclosure does not limit the order of execution of steps 401 and 402.

In step 403, the time domain features of the input audio and the frequency domain features of the input audio are fused, and sound recognition is performed based on the fused features. According to an exemplary embodiment of the present disclosure, the step of fusing the time-domain features of the input audio and the frequency-domain features of the input audio and performing the voice recognition based on the fused features may be performed by the fusion module 203.

According to an exemplary embodiment of the present disclosure, the time domain feature of the input audio and the frequency domain feature of the input audio may be spliced and transformed to obtain a fused feature. Here, after the time-domain features of the input audio and the frequency-domain features of the input audio are subjected to the splicing and transformation processes, they may be projected to the classification feature space, i.e., transformed into classification features (i.e., fused features). Softmax processing is performed on the classification features to obtain a predicted classification result (i.e., probability distribution values), thereby performing voice recognition.

According to an exemplary embodiment of the present disclosure, the time-domain features of the input audio and the frequency-domain features of the input audio may be fused in an early fusion manner. In the early fusion mode, the feature splicing layer is provided at the first layer, and the two FC layers are provided at the second layer and the third layer, respectively. Specifically, the time-domain features of the input audio and the frequency-domain features of the input audio may be spliced (e.g., at a first layer) to obtain spliced features, and two FC layer transforms may be performed on the spliced features (e.g., at a second layer and a third layer, respectively) to obtain the fused features. The early fusion mode firstly splices the time domain characteristics and the frequency domain characteristics and then transforms the spliced characteristics, so the transformation process is executed by integrating the time domain characteristics and the frequency domain characteristics, namely, the characteristics of all two domains of the audio signal are considered more comprehensively, and therefore, a good sound identification effect can be achieved.

According to an exemplary embodiment of the present disclosure, the time-domain features of the input audio and the frequency-domain features of the input audio may be fused in a mid-term fusion manner. In the middle-term fusion mode, one FC layer is disposed at the first layer, the characteristic splice layer is disposed at the second layer, and the other FC layer is disposed at the third layer. Specifically, a one-layer FC-layer transform may be performed on a time-domain feature of the input audio (e.g., at a first layer) to obtain a first transform feature, a one-layer FC-layer transform may be performed on a frequency-domain feature of the input audio (e.g., at the first layer) to obtain a second transform feature (where an order of the transform on the time-domain feature of the input audio and the transform on the frequency-domain feature of the input audio is not limited), the first transform feature and the second transform feature may be spliced (e.g., at a second layer) to obtain a spliced feature, and a one-layer FC-layer transform may be performed on the spliced feature (e.g., at a third layer) to obtain the fused feature.

According to an exemplary embodiment of the present disclosure, the time-domain features of the input audio and the frequency-domain features of the input audio may be fused in a post-fusion manner. In the post-fusion mode, two FC layers are respectively disposed on the first layer and the second layer, and the feature splicing layer is disposed on the third layer. Specifically, two-layer fully-connected layer transforms may be performed on the time-domain features of the input audio (e.g., at the first and second layers, respectively) to obtain third transform features, two-layer fully-connected layer transforms may be performed on the frequency-domain features of the input audio (e.g., at the first and second layers, respectively), fourth transform features may be obtained (where the order of the transforms on the time-domain features of the input audio and the transforms on the frequency-domain features of the input audio is not limited), and the third and fourth transform features may be spliced (e.g., at the third layer) to obtain the fused features.

Of course, the method of fusion is not limited to the above method, and any other feasible fusion method may be adopted to transform the time-domain features of the input audio and the frequency-domain features of the input audio together into the classification feature space, for example, the number of FC layers may not be necessarily two, but may also be a single layer, or three or more layers, the feature splicing layer may also be disposed at any position among a plurality of layers, or the sound recognition may be performed only by directly obtaining the fused features by splicing the time-domain features of the input audio and the frequency-domain features of the input audio.

Referring to fig. 5, a voice recognition apparatus 500 according to an exemplary embodiment of the present disclosure may include a time domain feature acquisition module 501, a frequency domain feature acquisition module 502, and a voice recognition module 503.

The time domain feature extraction module 501 may obtain time domain features of the input audio. Specifically, the time domain feature obtaining module 501 may obtain the time domain features of the input audio by extracting the time domain features from the original audio waveform of the input audio. Alternatively, the time domain feature obtaining module 501 may obtain the time domain feature of the input audio from a local storage or a server, etc. The time domain feature obtaining module 501 may also obtain the time domain features of the input audio in any other feasible manner, and the present disclosure does not limit the way and the source of obtaining.

According to an exemplary embodiment of the present disclosure, the time domain feature obtaining module 501 may extract time domain features from an original audio waveform of the input audio through the time branch model 201. According to an exemplary embodiment of the present disclosure, the temporal branching model 201 may be implemented using a multi-layered CNN or RNN, or the like. According to an exemplary embodiment of the present disclosure, the time-branching model 201 may be implemented using a sincenet model. Of course, the manner of extracting the time domain features is not limited to this, and any manner may be used to extract the time domain features, for example, any multi-layer CNN or RNN, etc. may be used to extract the time domain features.

The frequency domain characteristic obtaining module 502 may obtain a frequency domain characteristic of the input audio. Specifically, the frequency-domain feature obtaining module 502 may obtain the frequency-domain features of the input audio by performing a time-frequency transform (such as a fast fourier transform, STFT) on an original audio signal of the input audio and extracting the frequency-domain features from a spectral signal obtained by the time-frequency transform. Alternatively, the frequency domain characteristic obtaining module 502 may obtain the frequency domain characteristic of the input audio from a local storage or a server, etc. The frequency domain feature obtaining module 502 may obtain the frequency domain features of the input audio in any other feasible manner, and the present disclosure does not limit the obtaining way and source.

According to an exemplary embodiment of the present disclosure, the frequency domain feature obtaining module 502 may extract frequency domain features from the spectral signal through the frequency domain branching model 202. According to an exemplary embodiment of the present disclosure, the frequency domain branching model 202 may be implemented using one-dimensional or two-dimensional multi-layer CNNs. Of course, the manner of extracting the frequency domain features is not limited thereto, and any manner may be used to extract the frequency domain features.

The voice recognition module 503 may fuse the time-domain features of the input audio and the frequency-domain features of the input audio and perform voice recognition based on the fused features. According to an example embodiment of the present disclosure, the voice recognition module 503 may perform voice recognition through the fusion module 203.

According to an exemplary embodiment of the disclosure, the voice recognition module 503 may perform splicing and transformation on the time domain feature of the input audio and the frequency domain feature of the input audio to obtain a fused feature. Here, after the time-domain features of the input audio and the frequency-domain features of the input audio are subjected to the splicing and transformation processes, they may be projected to the classification feature space, i.e., transformed into classification features (i.e., fused features). The voice recognition module 503 may perform softmax processing on the classification features to obtain a predicted classification result (i.e., probability distribution values), thereby performing voice recognition.

According to an exemplary embodiment of the present disclosure, the voice recognition module 503 may fuse the time-domain features of the input audio and the frequency-domain features of the input audio by an early fusion. In the early fusion mode, the feature splicing layer is provided at the first layer, and the two FC layers are provided at the second layer and the third layer, respectively. Specifically, the voice recognition module 503 may concatenate the time-domain features of the input audio and the frequency-domain features of the input audio (e.g., at a first layer) to obtain concatenated features, and perform two-layer FC layer transforms on the concatenated features (e.g., at a second layer and a third layer, respectively) to obtain the fused features. The early fusion mode firstly splices the time domain characteristics and the frequency domain characteristics and then transforms the spliced characteristics, so the transformation process is executed by integrating the time domain characteristics and the frequency domain characteristics, namely, the characteristics of all two domains of the audio signal are considered more comprehensively, and therefore, a good sound identification effect can be achieved.

According to an exemplary embodiment of the present disclosure, the voice recognition module 503 may fuse the time-domain features of the input audio and the frequency-domain features of the input audio through a mid-term fusion manner. In the middle-term fusion mode, one FC layer is disposed at the first layer, the characteristic splice layer is disposed at the second layer, and the other FC layer is disposed at the third layer. Specifically, the voice recognition module 503 may perform a layer FC layer transform on the time domain features of the input audio (e.g., at a first layer) to obtain first transform features, perform a layer FC layer transform on the frequency domain features of the input audio (e.g., at the first layer) to obtain second transform features (where the order of the transform on the time domain features of the input audio and the transform on the frequency domain features of the input audio is not limited), concatenate the first transform features and the second transform features (e.g., at a second layer) to obtain concatenated features, and perform a layer FC layer transform on the concatenated features (e.g., at a third layer) to obtain the fused features.

According to an exemplary embodiment of the present disclosure, the sound recognition module 503 may fuse the time-domain features of the input audio and the frequency-domain features of the input audio through a post-fusion manner. In the post-fusion mode, two FC layers are respectively disposed on the first layer and the second layer, and the feature splicing layer is disposed on the third layer. Specifically, the voice recognition module 503 may perform two-layer fully-connected layer transformation on the time-domain features of the input audio (e.g., at the first layer and the second layer, respectively) to obtain third transformation features, perform two-layer fully-connected layer transformation on the frequency-domain features of the input audio (e.g., at the first layer and the second layer, respectively), obtain fourth transformation features (where the order of transformation of the time-domain features of the input audio and transformation of the frequency-domain features of the input audio is not limited), and concatenate the third transformation features and the fourth transformation features (e.g., at the third layer) to obtain the fused features.

According to exemplary embodiments of the present disclosure, the TFN model proposed by the present disclosure, and the voice recognition method and the voice recognition apparatus according to the present disclosure may be applied to speaker recognition. When the TFN model proposed by the present disclosure, and the voice recognition method and the voice recognition apparatus according to the present disclosure are applicable to speaker recognition, the input audio may be the voice of the speaker.

In particular, speaker recognition may include speaker recognition and speaker verification, depending on the type of output. The speaker recognizes a person for determining to which of the registered persons the inputted voice belongs and outputs an index of the predicted person. Speaker verification is used to confirm whether an input voice is uttered by the person and output true or false. Speaker recognition is a multi-classification problem and speaker verification is a two-classification problem. The multi-category problem may be transformed into a plurality of two-category problems. The voice recognition method and the voice recognition apparatus according to the present disclosure may be applied to speaker recognition as well as speaker verification.

Speaker ID may include text-dependent speaker identification and text-independent speaker identification depending on whether the user needs to cooperate with the system. The text-dependent speaker recognition system requires a user to speak specific content based on interaction with the system, so that the system can avoid voice attack of playing after recording and provide better robustness. However, it requires collaboration by the user, which may be limited in some applications, such as in a scenario where there is no interaction. A text independent speaker recognition system that does not need to specify the content of the input speech, which is more difficult, needs to recognize a speaker from speech of unknown content. Meanwhile, text independent speaker recognition systems are more widely used because of their less interactive needs. The voice recognition method and the voice recognition apparatus according to the present disclosure can be applied to text-dependent speaker recognition as well as to text-independent speaker recognition.

FIGS. 6a, 6b, 6c, and 6d illustrate schematic diagrams of a speaker recognition system. Wherein fig. 6a shows a schematic diagram of a text-dependent speaker recognition system, fig. 6b shows a schematic diagram of a text-dependent speaker verification system, fig. 6c shows a schematic diagram of a text-independent speaker recognition system, and fig. 6d shows a schematic diagram of a text-independent speaker verification system.

Table 1 below shows experimental data comparison results of the TFN model proposed by the present disclosure and the conventional sincenet model based on the timmit data set and the libristech data set.

[ Table 1]

The TIMIT dataset included 462 speakers and the LibriSpeech dataset included 2484 speakers. Experiments were performed using a model of a certain size by controlling the dimensions of the classification feature space. For the TIMIT dataset, the classification feature space dimension of the model is 1024. For the libristech dataset, the classification feature space dimension of the model was 2048. In the TFN model and sincenet model proposed by the present disclosure, the band pass filter data is set to 512 for the small model and 1024 for the large model. Furthermore, in the TFN model proposed by the present disclosure, MFCC is used as a spectrum for extracting frequency domain features. Classification Error Rate (CER) is used to evaluate model performance, with lower CER indicating better model performance. As can be seen from table 1, for the timmit dataset, the CER for the traditional sincent model is 0.85% and the CER for the TFN model proposed by the present disclosure is 0.65%. For the libristech dataset, the CER for the traditional sincent model is 0.96% and the CER for the TFN model proposed by the present disclosure is 0.32%. It can be seen that the TFN model proposed by the present disclosure exhibits better performance.

The voice recognition method and the voice recognition apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to fig. 2 to 6 d.

The various modules shown in fig. 5 may be configured as software, hardware, firmware, or any combination thereof that performs certain functions. For example, each module may correspond to a dedicated integrated circuit, to pure software code, or to a combination of software and hardware. Further, one or more functions implemented by the security monitoring module may also be performed collectively by components in a physical entity device (e.g., a processor, a client or server, etc.).

Further, the method described with reference to fig. 4 may be implemented by a program (or instructions) recorded on a computer-readable storage medium. For example, according to an exemplary embodiment of the present disclosure, a computer-readable storage medium may be provided having instructions stored thereon, wherein the instructions, when executed on at least one computing device, cause the at least one computing device to perform a sound recognition method according to the present disclosure.

The computer program in the computer-readable storage medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may also be used to perform additional steps other than the above steps or perform more specific processing when the above steps are performed, and the content of the additional steps and the further processing is already mentioned in the description of the related method with reference to fig. 4, and therefore will not be described again here to avoid repetition.

It should be noted that each module according to the exemplary embodiments of the present disclosure may completely depend on the execution of the computer program to realize the corresponding function, that is, each module corresponds to each step in the functional architecture of the computer program, so that the entire system is called by a special software package (e.g., lib library) to realize the corresponding function.

Alternatively, the various modules shown in FIG. 5 may be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component having stored therein a set of computer-executable instructions that, when executed by a processor, perform a voice recognition method according to exemplary embodiments of the present disclosure.

In particular, computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions.

The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In a computing device, a processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the voice recognition method according to the exemplary embodiments of the present disclosure may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.

The processor may execute instructions or code stored in one of the memory components, which may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.

The voice recognition method according to the exemplary embodiments of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.

Thus, the voice recognition method described with reference to fig. 4 may be implemented by a voice recognition apparatus comprising at least one computing device and at least one storage device storing computer instructions.

According to an exemplary embodiment of the present disclosure, the at least one computing device is a computing device for performing a sound recognition method according to an exemplary embodiment of the present disclosure, the storage device having stored therein a set of computer-executable instructions that, when executed by the at least one computing device, performs the sound recognition method described with reference to fig. 4.

While various exemplary embodiments of the present disclosure have been described above, it should be understood that the above description is exemplary only, and not exhaustive, and that the present disclosure is not limited to the disclosed exemplary embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. Therefore, the protection scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A method of voice recognition, comprising:

acquiring time domain characteristics of input audio;

acquiring frequency domain characteristics of the input audio;

and fusing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio, and executing sound identification based on the fused characteristics.

2. The sound recognition method of claim 1, wherein the fusing the time-domain features of the input audio and the frequency-domain features of the input audio comprises:

and splicing and transforming the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain the fused characteristics.

3. The sound recognition method of claim 2, wherein the splicing and transforming the time-domain features of the input audio and the frequency-domain features of the input audio to obtain the fused features comprises:

splicing the time domain characteristics of the input audio and the frequency domain characteristics of the input audio to obtain spliced characteristics;

and executing two-layer full-connection layer transformation on the spliced features to obtain the fused features.

4. The sound recognition method of claim 2, wherein the splicing and transforming the time-domain features of the input audio and the frequency-domain features of the input audio to obtain the fused features comprises:

performing a layer of full-link layer transform on the time domain features of the input audio to obtain first transform features;

performing a layer of full-link layer transform on the frequency domain characteristics of the input audio to obtain second transform characteristics;

splicing the first transformation characteristic and the second transformation characteristic to obtain a spliced characteristic;

and executing one-layer full-connection layer transformation on the spliced features to obtain the fused features.

5. The sound recognition method of claim 2, wherein the splicing and transforming the time-domain features of the input audio and the frequency-domain features of the input audio to obtain the fused features comprises:

performing two-layer full-link layer transformation on the time domain characteristics of the input audio to obtain third transformation characteristics;

performing two-layer full-connected layer transformation on the frequency domain characteristics of the input audio to obtain fourth transformation characteristics;

and splicing the third transformation characteristic and the fourth transformation characteristic to obtain the fused characteristic.

6. A voice recognition apparatus, comprising:

a time domain feature acquisition module configured to acquire a time domain feature of an input audio;

a frequency domain feature acquisition module configured to acquire a frequency domain feature of the input audio;

a voice recognition module configured to fuse the time-domain features of the input audio and the frequency-domain features of the input audio and perform voice recognition based on the fused features.

7. The voice recognition apparatus of claim 6, wherein the voice recognition module is configured to:

8. The voice recognition apparatus of claim 7, wherein the voice recognition module is configured to:

9. The voice recognition apparatus of claim 7, wherein the voice recognition module is configured to:

performing one-layer full-connection layer transformation on the frequency domain characteristics of the input audio to obtain second transformation characteristics;

splicing the first transformation characteristic and the second transformation characteristic to obtain a spliced characteristic; and executing one-layer full-connection layer transformation on the spliced features to obtain the fused features.

10. The voice recognition apparatus of claim 7, wherein the voice recognition module is configured to:

11. A voice recognition apparatus comprising at least one computing device and at least one memory device having computer instructions stored thereon, wherein the computer instructions, when executed by the at least one computing device, cause the at least one computing device to perform a voice recognition method as claimed in any one of claims 1 to 5.

12. A computer-readable storage medium having instructions stored thereon, which when executed on at least one computing device, cause the at least one computing device to perform a voice recognition method as claimed in any one of claims 1 to 5.