CN114333811A - Voice recognition method, system and equipment - Google Patents

Voice recognition method, system and equipment Download PDF

Info

Publication number
CN114333811A
CN114333811A CN202011060928.0A CN202011060928A CN114333811A CN 114333811 A CN114333811 A CN 114333811A CN 202011060928 A CN202011060928 A CN 202011060928A CN 114333811 A CN114333811 A CN 114333811A
Authority
CN
China
Prior art keywords
neural network
complex
deep neural
part vector
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011060928.0A
Other languages
Chinese (zh)
Inventor
潘昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN202011060928.0A priority Critical patent/CN114333811A/en
Publication of CN114333811A publication Critical patent/CN114333811A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention provides a voice recognition method, a system and equipment, wherein the method comprises the following steps: acquiring a multi-channel voice signal to be recognized; processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length; performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified; inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training; and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized. The scheme of the invention can support multi-microphone array data more comprehensively, and has more detailed modeling and fitting on data characteristics and better noise resistance.

Description

Voice recognition method, system and equipment
Technical Field
The present invention relates to the field of speech recognition technology, and in particular, to a speech recognition method, system and device.
Background
Currently, most of the component parts, techniques and architectures of deep learning are based on real-valued operations and representations. However, recent research on recurrent neural networks and earlier fundamental analyses has shown that complex numbers can have richer representational capabilities and can also facilitate noise-robust memory retrieval mechanisms. They may have the potential to bring new neural architectures into the spotlight. They are applied here in CNN (complex) and Transformer (translator). More precisely, complex batch normalization of complex-valued neural networks, weight initialization strategies, are implemented relying on complex convolution and current algorithms, and they are used in experiments with end-to-end training schemes. Experiments prove that the performance of the complex-valued model can be better compared with a real-valued model with the same structure.
For a Transformer, it is simpler to understand with the usual machine translation example.
First, this model is considered a black box operation. In machine translation, one language is input and the other language is output.
The black box is disassembled and it can be seen that it is composed of an encoding component, a decoding component and connections between them.
The encoding component part is constituted by a bank of encoders (encoders). The decoding component part is also composed of the same number (corresponding to the encoder) of decoders (decoders).
All encoders are structurally identical, but they do not share parameters. Each decoder can be decomposed into two sub-layers. The sentence input from the encoder first goes through a self-attention (self-attention) layer that helps the encoder focus on the other words of the input sentence when encoding each word. The output from the attention layer is passed into a feed-forward neural network. The feedforward neural network corresponding to the word at each position is identical.
There are also self-attention and feedforward layers of the encoder in the decoder. In addition, there is an attention layer between the two layers to pay attention to the relevant parts of the input sentence.
For example, the following sentences are input sentences intended to be translated:
The animal didn’t cross the street because it was too tired
what is this "it" in this sentence? It refers to street or this animal? This is a simple problem for humans, but not for algorithms. When the model processes the word "it," the self-attention mechanism will allow "it" to be linked to "animal".
As the model processes each word of the input sequence, self-attention may focus on all words of the entire input sequence, helping the model to better encode the word.
The existing scheme is based on a neural network model, the neural network model is obtained through training, and the existing data feature modeling space is in a real number space, so that the phase information of voice cannot be well described and modeled, and meanwhile, a plurality of open researches show that the performance is partially improved after the phase information of voice is considered into the modeling of the model. In addition, the existing scheme is difficult to fuse the contents of different channels of the multi-microphone array data, so that the fitting of a model to the data is prevented, and the recognition result is influenced finally.
The prior scheme has the following defects:
the support for multi-microphone array data is not good, and multi-channel input data is not completely applied;
the imaginary part of the modeling space in the real number domain data is completely abandoned;
robustness to noise is poor and is not well met in practical industrial applications.
Disclosure of Invention
The invention provides a voice recognition method, a system and equipment. The support to the multi-microphone array data is more comprehensive, the modeling and fitting to the data characteristics are more detailed, and the resistance to noise is better.
To solve the above technical problem, an embodiment of the present invention provides the following solutions:
a method of speech recognition, the method comprising:
acquiring a multi-channel voice signal to be recognized;
processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length;
performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified;
inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training;
and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized.
Optionally, processing the multi-channel speech signal to be recognized to obtain an audio frame with a fixed length, including:
and segmenting the multi-channel voice signal to be identified according to the time domain to obtain a plurality of audio frames with fixed length.
Optionally, the speech recognition method further includes: and adding a Hamming window to the audio frame.
Optionally, performing fourier transform on the audio frame to obtain a real component vector and an imaginary component vector of the multi-channel speech signal to be recognized, including:
for the audio frame X (K, i), where K represents that the time belongs to the range of [0, K ], i represents that the channel number value range is [1, c ], and c represents the channel number, using the following formula:
Figure BDA0002712351520000031
converting the time domain signal into a frequency domain signal, wherein x (n, i) is a real part,
Figure BDA0002712351520000032
for the imaginary part, the multichannel input signal may be represented as a real part vector real (n, i) and an imaginary part vector image (j, i).
Optionally, the complex deep neural network includes: complex CNN, 1 × 1 convolution and translation structures;
the complex deep neural network is obtained by training through the following processes:
modeling using a complex neural network according to the real part vector and the imaginary part vector of the training speech signal;
and marking characters corresponding to the training voice signals as labels of the complex deep neural network, and training by using real part vectors and imaginary part vectors of the training voice signals as features of the complex deep neural network to determine the complex deep neural network.
Optionally, the training the real part vector and the imaginary part vector of the training speech signal as the features of the complex deep neural network to determine the complex deep neural network includes:
splicing the real part vector and the imaginary part vector of the training voice signal, and inputting the spliced real part vector and the imaginary part vector into a complex CNN (continuous noise network) for calculation;
calculating the output result of the complex number CNN to 1 × 1 convolution;
inputting the output result of the 1 x 1 convolution into a translation structure for processing; the translation structure includes: an encoder and a decoder, an output result of the encoder being input to the decoder;
and training according to the output of the decoding structure to determine a complex deep neural network.
Optionally, in the complex deep neural network, when a real part vector is denoted by a and an imaginary part vector is denoted by B, W is equal to a + Bi, and h is equal to x + yi for one filter;
the expression of 1 × 1 convolution is:
W*h=(A*x-B*y)+i(B*x+A*y)
the above equation is expressed using a matrix as:
Figure BDA0002712351520000041
using the complex Relu, i.e., CRelu, the formula is defined as follows:
C Re lu=Re lu(r e al(W*h)+iRe lu(imag(W*h))
optionally, the speech recognition method further includes: and scoring the output result of the decoder by using a beam search algorithm.
Optionally, training according to the output of the decoding structure includes:
the output from the decoding structure is trained using log cross entropy log-CE as a discriminant function, while using the L1 norm of the ensemble of parameters as an additional loss addition.
An embodiment of the present invention further provides a speech recognition system, including:
the acquisition module is used for acquiring a multi-channel voice signal to be recognized;
the processing module is used for processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length; performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified; inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training; and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized.
An embodiment of the present invention further provides a speech recognition apparatus, including: a processor, a memory storing a computer program which, when executed by the processor, performs the method as described above.
Embodiments of the present invention also provide a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method as described above.
The scheme of the invention at least comprises the following beneficial effects:
according to the scheme, the multi-channel voice signal to be recognized is obtained; processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length; performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified; inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training; and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized. The method perfectly supports multi-microphone data as the input of a voice recognition system, and uses a plurality of neural networks to model geometric information and related information among different channels; modeling the characteristics in a complex space, fully considering the phase information, and improving the recognition rate of the voice recognition together with the power spectrum; the 1 x 1 convolution is used to communicate the input information of different channels, increasing the flow of information between different channels.
Drawings
FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating the effect of phase estimation in an embodiment of the present invention;
FIG. 3 is a diagram illustrating the effect of phase estimation in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a complex deep neural network according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the structure of complex blocks according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating specific steps of a process of a speech recognition method according to an embodiment of the present invention;
fig. 7 is a block diagram of a speech recognition system according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
As shown in fig. 1, an embodiment of the present invention provides a speech recognition method, including:
step 11, acquiring a multi-channel voice signal to be recognized;
step 12, processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length;
step 13, performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified;
step 14, inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training;
and step 15, processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized.
In the embodiment, the multi-channel voice signal can be used for voice recognition of a microphone array, multi-microphone data is used as input of a voice recognition system, and a plurality of neural networks are used for modeling geometric information and related information between different channels; modeling the characteristics in a complex space, fully considering the phase information, and improving the recognition rate of the voice recognition together with the power spectrum; the 1 x 1 convolution is used to communicate the input information of different channels, increasing the flow of information between different channels.
In an alternative embodiment of the present invention, step 12 may include:
and step 121, segmenting the multi-channel voice signal to be identified according to the time domain to obtain a plurality of audio frames with fixed length.
In an optional embodiment of the present invention, the speech recognition method may further include:
step 122, adding a Hamming window to the audio frame. Therefore, the frequency spectrum leakage can be prevented, and the network training can be facilitated.
In an alternative embodiment of the present invention, step 13 may include:
for the audio frame X (K, i), where K represents that the time belongs to the range of [0, K ], i represents that the channel number value range is [1, c ], and c represents the channel number, using the following formula:
Figure BDA0002712351520000061
converting the time domain signal into a frequency domain signal, wherein x (n, i) is a real part,
Figure BDA0002712351520000062
for the imaginary part, the multichannel input signal may be represented as a real part vector real (n, i) and an imaginary part vector image (j, i).
For the process of complex deep neural network training, the method is also adopted to carry out Fourier transform processing on the audio signal, the original time domain signal is converted into a frequency domain, and the signal at the moment can be represented as a power spectrum and a phase spectrum which correspond to the real part and the imaginary part calculation result of the Fourier transform result.
In an optional embodiment of the present invention, the complex deep neural network includes: complex CNN, 1 × 1 convolution and translation structures;
wherein the complex deep neural network is obtained by training through the following processes:
modeling using a complex neural network according to the real part vector and the imaginary part vector of the training speech signal;
and marking characters corresponding to the training voice signals as labels of the complex deep neural network, and training by using real part vectors and imaginary part vectors of the training voice signals as features of the complex deep neural network to determine the complex deep neural network.
Here, too, the training speech signal may be divided into fixed-length audio frames, and then a hamming window is added to prevent spectral leakage. This may facilitate training of the network.
For an input training speech signal X (K, i), where K represents that time belongs to the range of [0, K ], i represents that the channel number value range is [1, c ], and c represents the channel number, for example, 6 m is 6. Utilizing the following formula:
Figure BDA0002712351520000071
converting the time domain signal into a frequency domain signal, wherein x (n, i) is a real part,
Figure BDA0002712351520000072
is the imaginary part. The original multi-channel input signal can thus be represented as a real component vector real (n, i) and an imaginary component vector image (j, i).
In an optional embodiment of the present invention, the training of the real part vector and the imaginary part vector of the training voice signal as the features of the complex deep neural network to determine the complex deep neural network includes:
splicing the real part vector and the imaginary part vector of the training voice signal, and inputting the spliced real part vector and the imaginary part vector into a complex CNN (continuous noise network) for calculation;
calculating the output result of the complex number CNN to 1 × 1 convolution;
inputting the output result of the 1 x 1 convolution into a translation structure for processing; the translation structure includes: an encoder and a decoder, an output result of the encoder being input to the decoder;
and training according to the output of the decoding structure to determine a complex deep neural network.
Specifically, in the complex deep neural network, when a real part vector is denoted by a and an imaginary part vector is denoted by B, W is a + Bi, and h is x + yi for one filter;
the expression of 1 × 1 convolution is:
W*h=(A*x-B*y)+i(B*x+A*y)
the above equation is expressed using a matrix as:
Figure BDA0002712351520000073
using the complex Relu, i.e., CRelu, the formula is defined as follows:
C Re lu=Re lu(r e al(W*h)+iRe lu(imag(W*h))
in this embodiment, the complex deep neural network provides an end-to-end multi-microphone array speech recognition approach. I.e., no manual design features are required and the length of the processed audio is no longer a limitation of the network architecture design. Different from the traditional array processing mode, the complex deep neural network implicitly completes the conversion of multi-channel data to single-channel data, and the work is completed to meet the input requirement of a voice recognition system, and the voice signal processed in the previous step contains real part and imaginary part characteristics. The existing scheme directly discards the imaginary part characteristics of the voice, namely the phase spectrum information of the voice.
Embodiments of the present invention overcome this problem by using a complex neural network for modeling, which allows both real and imaginary parts to be modeled simultaneously. The problem that the phase distortion cannot be avoided by using real number spectrum modeling in the conventional frequency domain model is solved. As shown in fig. 2, line 21 is the original waveform, and line 22 is the result of re-estimating the power spectrum phase; as shown in fig. 3, line 31 is the original waveform and line 32 is the result of only the power spectrum re-estimation.
The complex neural network structure described in the embodiments of the present invention is shown in fig. 4 and 5. Fig. 5 shows a structure of a complex Block, where each complex Block (complex Block) includes a complex CNN and 1 × 1 convolution;
in a complex deep neural network, for convenience of representing embodiments of the present invention, the real part vector is named a, the imaginary part vector is named B, W is a + Bi, and h is x + yi for one filter. The calculation of their convolution can be expressed as:
W*h=(A*x-B*y)+i(B*x+A*y)
the above equation is expressed using a matrix as:
Figure BDA0002712351520000081
convolution calculations in complex deep neural networks can be accomplished using embodiments of the present invention in the above equation.
In addition, the original Relu activation function is not applicable, so the embodiment of the present invention uses a complex Relu, i.e., CRelu, defined as follows:
C Re lu=Re lu(r e al(W*h)+iRe lu(imag(W*h))
in addition, the present embodiment uses 1 × 1 convolution. Using 1 x 1 convolution can bring non-linear computations and dimensionality reduction to the depth network with little increase in the number of parameters. Since embodiments of the present invention are applied to a multiwheat array in which the modeling space is complex, it becomes important to reduce the amount of parameters. In addition to this, there is a link between complex and real numbers, although they are two data spaces. The 1 × 1 convolution can well complete the work of cross-channel communication, and brings gain to the complex number CNN of the next layer.
In the transform section, QKV matrices of the input transform are all the results of the computation from the last complex Block. Embodiments of the present invention use an 8-headed (head) transformer because the transformer is more capable of feature extraction than the CNN model in view of the orthogonal nature of complex spatial data. And the transformer uses the position related information to carry out additional coding on the input data during calculation, so that the prediction sequence output by the decoding calculation network has better discriminability, and better results can be obtained after the beam search is used.
In an optional embodiment of the present invention, the speech recognition method may further include: and scoring the output result of the decoder by using a beam search algorithm. The output of the network at decoding comes from the transform's encoder layer, and since there is a set of candidate words as alternatives at each instant, the selection of the final word sequence is made using the beam search and following the maximum a posteriori probability criterion. RTF ═ 0.29. The decoding error rate is reduced from 7.75% to 6.65%.
In an optional embodiment of the present invention, the training according to the output of the decoding structure includes:
the output from the decoding structure is trained using log cross entropy log-CE as the discriminant function, while using L1-norm (L1 norm) of the ensemble of parameters as an additional loss addition. The optimization method may use Adam. The L1 norm represents the sum of the absolute values of the non-zero elements in the vector x. The Difference between two vectors, such as the Sum of Absolute errors (Sum of Absolute Difference), can be measured using the L1 norm.
As shown in fig. 6, which is a flow chart of the speech recognition method of the present invention, an input speech signal is converted into a corresponding text sequence by a speech recognition system. The specific process is as follows:
firstly, windowing is carried out on an input voice signal in a frame mode, so that the computer processing is facilitated, information of surrounding frames influenced by frequency spectrum leakage is eliminated, then an original time domain signal is converted into a frequency domain by using DFT (discrete Fourier transform), and the signal at the moment can be represented as a power spectrum and a phase spectrum which correspond to real part and imaginary part calculation results of Fourier transform results. And then, putting the character labels corresponding to the voice in the network as labels, and putting the real part results and the imaginary part results in a complex deep network as features to train the network.
In recognition, the voice signal to be recognized is put into the network for reasoning to obtain the Chinese subsequence to be recognized and a plurality of alternative answers of each word, which are output by the network, and then the Chinese sequence with the highest score is obtained by using the beam search scoring to serve as the recognition result of the sentence of voice.
In the above embodiments of the present invention, the complex deep neural network used includes complex CNN, 1 × 1 convolution and transform structures. The Transformer structure comprises an encoder part and a decoder part, and a beam search algorithm is additionally used for scoring during decoding.
The method and the system of the embodiment of the invention can be deployed as a voice recognition scheme in a conference room, and can also use data with noise to complete a voice recognition system with stable noise in the training process. The method can perfectly support multi-microphone data as the input of a voice recognition system, and uses a plurality of neural networks to model geometric information and related information among different channels; modeling the characteristics in a complex space, fully considering the phase information, and improving the recognition rate of the voice recognition together with the power spectrum; the 1 x 1 convolution is used to communicate the input information of different channels, increasing the flow of information between different channels.
As shown in fig. 7, an embodiment of the present invention further provides a speech recognition system 70, including:
an obtaining module 71, configured to obtain a multi-channel speech signal to be recognized;
the processing module 72 is configured to process the multi-channel speech signal to be recognized to obtain an audio frame with a fixed length; performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified; inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training; and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized.
Processing the multi-channel voice signal to be recognized to obtain an audio frame with a fixed length, and the method comprises the following steps:
and segmenting the multi-channel voice signal to be identified according to the time domain to obtain a plurality of audio frames with fixed length.
Optionally, the speech recognition method further includes:
and adding a Hamming window to the audio frame.
Optionally, performing fourier transform on the audio frame to obtain a real component vector and an imaginary component vector of the multi-channel speech signal to be recognized, including:
for the audio frame X (K, i), where K represents that the time belongs to the range of [0, K ], i represents that the channel number value range is [1, c ], and c represents the channel number, using the following formula:
Figure BDA0002712351520000101
converting the time domain signal into a frequency domain signal, wherein x (n, i) is a real part,
Figure BDA0002712351520000102
for the imaginary part, the multichannel input signal may be represented as a real part vector real (n, i) and an imaginary part vector image (j, i).
Optionally, the complex deep neural network includes: complex CNN, 1 × 1 convolution and translation structures;
the complex deep neural network is obtained by training through the following processes:
modeling using a complex neural network according to the real part vector and the imaginary part vector of the training speech signal;
and marking characters corresponding to the training voice signals as labels of the complex deep neural network, and training by using real part vectors and imaginary part vectors of the training voice signals as features of the complex deep neural network to determine the complex deep neural network.
Optionally, the training the real part vector and the imaginary part vector of the training speech signal as the features of the complex deep neural network to determine the complex deep neural network includes:
splicing the real part vector and the imaginary part vector of the training voice signal, and inputting the spliced real part vector and the imaginary part vector into a complex CNN (continuous noise network) for calculation;
calculating the output result of the complex number CNN to 1 × 1 convolution;
inputting the output result of the 1 x 1 convolution into a translation structure for processing; the translation structure includes: an encoder and a decoder, an output result of the encoder being input to the decoder;
and training according to the output of the decoding structure to determine a complex deep neural network.
Optionally, in the complex deep neural network, when a real part vector is denoted by a and an imaginary part vector is denoted by B, W is equal to a + Bi, and h is equal to x + yi for one filter;
the expression of 1 × 1 convolution is:
W*h=(A*x-B*y)+i(B*x+A*y)
the above equation is expressed using a matrix as:
Figure BDA0002712351520000111
using the complex Relu, i.e., CRelu, the formula is defined as follows:
C Re lu=Re lu(r e al(W*h)+iRe lu(imag(W*h))
optionally, the speech recognition method further includes: and scoring the output result of the decoder by using a beam search algorithm.
Optionally, training according to the output of the decoding structure includes:
the output from the decoding structure is trained using log cross entropy log-CE as a discriminant function, while using the L1 norm of the ensemble of parameters as an additional loss addition.
It should be noted that the system is a system corresponding to the above method, and all implementation manners of the above method are applicable to the embodiment of the system, and the same technical effect can be achieved.
An embodiment of the present invention further provides a speech recognition apparatus, including: a processor, a memory storing a computer program which, when executed by the processor, performs the method as described above. All the implementation modes of the method are suitable for the embodiment of the system, and the same technical effect can be achieved.
Embodiments of the present invention also provide a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method as described above. All the implementation modes of the method are suitable for the embodiment of the system, and the same technical effect can be achieved.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
Furthermore, it is to be noted that in the device and method of the invention, it is obvious that the individual components or steps can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of performing the series of processes described above may naturally be performed chronologically in the order described, but need not necessarily be performed chronologically, and some steps may be performed in parallel or independently of each other. It will be understood by those skilled in the art that all or any of the steps or elements of the method and apparatus of the present invention may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present invention.
Thus, the objects of the invention may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. The object of the invention is thus also achieved solely by providing a program product comprising program code for implementing the method or the apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future. It is further noted that in the apparatus and method of the present invention, it is apparent that each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (12)

1. A method of speech recognition, the method comprising:
acquiring a multi-channel voice signal to be recognized;
processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length;
performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified;
inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training;
and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized.
2. The speech recognition method of claim 1, wherein processing the multi-channel speech signal to be recognized to obtain fixed-length audio frames comprises:
and segmenting the multi-channel voice signal to be identified according to the time domain to obtain a plurality of audio frames with fixed length.
3. The speech recognition method of claim 2, further comprising:
and adding a Hamming window to the audio frame.
4. The speech recognition method of claim 1, wherein fourier transforming the audio frame to obtain real and imaginary vectors of the multi-channel speech signal to be recognized comprises:
for the audio frame X (K, i), where K represents that the time belongs to the range of [0, K ], i represents that the channel number value range is [1, c ], and c represents the channel number, using the following formula:
Figure FDA0002712351510000011
converting the time domain signal into a frequency domain signal, wherein x (n, i) is a real part,
Figure FDA0002712351510000012
for the imaginary part, the multichannel input signal may be represented as a real part vector real (n, i) and an imaginary part vector image (j, i).
5. The speech recognition method of claim 1, wherein the complex deep neural network comprises: complex CNN, 1 × 1 convolution and translation structures;
the complex deep neural network is obtained by training through the following processes:
modeling using a complex neural network according to the real part vector and the imaginary part vector of the training speech signal;
and marking characters corresponding to the training voice signals as labels of the complex deep neural network, and training by using real part vectors and imaginary part vectors of the training voice signals as features of the complex deep neural network to determine the complex deep neural network.
6. The speech recognition method of claim 5, wherein the training of the real and imaginary vectors of the speech signal as features of the complex deep neural network to determine the complex deep neural network comprises:
splicing the real part vector and the imaginary part vector of the training voice signal, and inputting the spliced real part vector and the imaginary part vector into a complex CNN (continuous noise network) for calculation;
calculating the output result of the complex number CNN to 1 × 1 convolution;
inputting the output result of the 1 x 1 convolution into a translation structure for processing; the translation structure includes: an encoder and a decoder, an output result of the encoder being input to the decoder;
and training according to the output of the decoding structure to determine a complex deep neural network.
7. The speech recognition method of claim 6,
in the complex deep neural network, a real part vector is represented by a, an imaginary part vector is represented by B, W is equal to a + Bi, and h is equal to x + yi for one filter;
the expression of 1 × 1 convolution is:
W*h=(A*x-B*y)+i(B*x+A*y)
the above equation is expressed using a matrix as:
Figure FDA0002712351510000021
using the complex Relu, i.e., CRelu, the formula is defined as follows:
CRelu=Relu(real(W*h)+iRelu(imag(W*h))。
8. the speech recognition method of claim 6, further comprising:
and scoring the output result of the decoder by using a beam search algorithm.
9. The speech recognition method of claim 6, wherein training based on the output of the decoding structure comprises:
the output from the decoding structure is trained using log cross entropy log-CE as a discriminant function, while using the L1 norm of the ensemble of parameters as an additional loss addition.
10. A speech recognition system, comprising:
the acquisition module is used for acquiring a multi-channel voice signal to be recognized;
the processing module is used for processing the multi-channel voice signal to be identified to obtain an audio frame with a fixed length; performing Fourier transform on the audio frame to obtain a real part vector and an imaginary part vector of the multi-channel voice signal to be identified; inputting the real part vector and the imaginary part vector into a complex deep neural network obtained through training; and processing through the complex deep neural network, and outputting the recognition result of the multi-channel voice signal to be recognized.
11. A speech recognition device, comprising: a processor, a memory storing a computer program which, when executed by the processor, performs the method of any of claims 1 to 9.
12. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any of claims 1 to 9.
CN202011060928.0A 2020-09-30 2020-09-30 Voice recognition method, system and equipment Pending CN114333811A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011060928.0A CN114333811A (en) 2020-09-30 2020-09-30 Voice recognition method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011060928.0A CN114333811A (en) 2020-09-30 2020-09-30 Voice recognition method, system and equipment

Publications (1)

Publication Number Publication Date
CN114333811A true CN114333811A (en) 2022-04-12

Family

ID=81011359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011060928.0A Pending CN114333811A (en) 2020-09-30 2020-09-30 Voice recognition method, system and equipment

Country Status (1)

Country Link
CN (1) CN114333811A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221846A (en) * 2022-06-08 2022-10-21 华为技术有限公司 Data processing method and related equipment
CN117935838A (en) * 2024-03-25 2024-04-26 深圳市声扬科技有限公司 Audio acquisition method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115221846A (en) * 2022-06-08 2022-10-21 华为技术有限公司 Data processing method and related equipment
CN117935838A (en) * 2024-03-25 2024-04-26 深圳市声扬科技有限公司 Audio acquisition method and device, electronic equipment and storage medium
CN117935838B (en) * 2024-03-25 2024-06-11 深圳市声扬科技有限公司 Audio acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10957303B2 (en) Training apparatus, speech synthesis system, and speech synthesis method
Zhu et al. A noise-robust self-supervised pre-training model based speech representation learning for automatic speech recognition
CN114787914A (en) System and method for streaming end-to-end speech recognition with asynchronous decoder
CN112259100B (en) Speech recognition method, training method of related model, related equipment and device
CN113505610B (en) Model enhancement-based speech translation model training method and system, and speech translation method and equipment
CN113569562B (en) Method and system for reducing cross-modal and cross-language barriers of end-to-end voice translation
Zhu et al. Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning
CN114333811A (en) Voice recognition method, system and equipment
US11557283B2 (en) Artificial intelligence system for capturing context by dilated self-attention
CN111783423A (en) Training method and device of problem solving model and problem solving method and device
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN114999460A (en) Lightweight Chinese speech recognition method combined with Transformer
CN117877460A (en) Speech synthesis method, device, speech synthesis model training method and device
Tian et al. Integrating lattice-free MMI into end-to-end speech recognition
CN113505611B (en) Training method and system for obtaining better speech translation model in generation of confrontation
CN114912441A (en) Text error correction model generation method, error correction method, system, device and medium
Kim et al. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning
Hung et al. The evaluation study of the deep learning model transformer in speech translation
Zhu et al. Discriminative speaker embedding with serialized multi-layer multi-head attention
CN116863920B (en) Voice recognition method, device, equipment and medium based on double-flow self-supervision network
JP6820764B2 (en) Acoustic model learning device and acoustic model learning program
Shankarappa et al. A faster approach for direct speech to speech translation
Joshi et al. Attention based end to end speech recognition for voice search in hindi and english
CN113593534B (en) Method and device for multi-accent speech recognition
Laitonjam et al. A hybrid machine transliteration model based on multi-source encoder–decoder framework: English to manipuri

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination