CN108564940B

CN108564940B - Speech recognition method, server and computer-readable storage medium

Info

Publication number: CN108564940B
Application number: CN201810227474.8A
Authority: CN
Inventors: 梁浩; 王健宗; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2020-04-28
Anticipated expiration: 2038-03-20
Also published as: CN108564940A; WO2019179034A1

Abstract

The invention discloses a voice recognition method, which comprises the following steps: constructing an acoustic model; when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part; extracting acoustic features from the valid speech portion; inputting the acoustic features into an acoustic model, performing phoneme recognition on the acoustic features through a trained phoneme training model, and outputting a recognition result to a trained hybrid neural network model based on memory unit connection; and outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection. The invention also provides a server and a computer readable storage medium. The voice recognition method, the server and the computer readable storage medium provided by the invention can improve the accuracy of voice recognition.

Description

Speech recognition method, server and computer-readable storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, a server, and a computer-readable storage medium.

Background

Speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims to make a machine change Speech signals into words by Recognition and understanding, and is an important branch of the development of modern artificial intelligence. The realization of the voice recognition technology is the premise of natural language processing, the development of the voice control interaction related field can be effectively promoted, and the voice recognition technology is greatly convenient for people's life, such as intelligent home and voice input, so that some people who are not suitable for using hands and eyes, such as middle-aged and old people, and environments, such as driving, on the road and other scenes, can carry out command operation, and the realization is realized. The accuracy of speech recognition directly determines the effectiveness of the technical application, but the accuracy of current speech recognition does not meet the user's requirements.

Disclosure of Invention

In view of the above, the present invention provides a speech recognition method, a server and a computer readable storage medium, which can improve the accuracy of speech recognition.

First, to achieve the above object, the present invention provides a speech recognition method, including:

constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection;

when an original voice signal is acquired, preprocessing the voice signal to extract an effective voice part;

extracting acoustic features from the valid speech portion;

inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained memory unit connection-based hybrid neural network model;

and outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection.

Optionally, the step of preprocessing the speech signal to extract an effective speech part when the speech signal is acquired specifically includes:

pre-emphasizing the speech signal to boost high frequency portions in the speech signal;

framing and windowing the speech signal to convert a non-stationary signal to a short-time stationary signal;

and removing the noise of the short-time stationary signal, and extracting an effective voice part, wherein the effective voice part is the short-time stationary signal in a preset frequency.

Optionally, the step of extracting acoustic features from the valid speech part specifically includes:

fourier transforming the effective speech portion to convert the speech portion in the time domain to an energy spectrum in the frequency domain;

according to the energy spectrum, highlighting formant features of the voice part through a set of Mel-scale triangular filter banks;

and obtaining acoustic characteristics by performing discrete cosine transform on the energy spectrum output by the triangular filter bank.

Optionally, the training models of phonemes include a monophonic phoneme model and a triphone model, and the step of inputting the acoustic features into the acoustic model, recognizing the acoustic features through the training models of phonemes, and outputting the recognition result to the memory unit connection-based hybrid neural network model specifically includes:

comparing the similarity of different phoneme pronunciations according to the acoustic characteristics through the single-phoneme model, and outputting an alignment result to the triphone model;

combining the influence of front and rear related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result;

and outputting the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.

Optionally, the acoustic feature is mfcc (mel frequency cepstrum coefficient).

In addition, to achieve the above object, the present invention further provides a server, which includes a memory and a processor, wherein the memory stores a speech recognition system operable on the processor, and the speech recognition system implements the following steps when executed by the processor:

extracting acoustic features from the valid speech portion;

and obtaining an acoustic feature by discrete cosine transform of the energy spectrum output by the triangular filter bank, wherein the acoustic feature is MFCC (mel frequency cepstrum coefficient).

Further, to achieve the above object, the present invention also provides a computer-readable storage medium storing a speech recognition system, which can be executed by at least one processor to cause the at least one processor to execute the steps of the speech recognition method as described above.

Compared with the prior art, the server, the voice recognition method and the computer-readable storage medium provided by the invention have the advantages that the acoustic model constructed by the server, the voice recognition method and the computer-readable storage medium comprises a phoneme training model and a hybrid neural network model. The mixed neural network model comprises a long-time recurrent neural network HLSTM-RNN, a convolutional neural network CNN, a feedforward neural network DNN and a hidden Markov model HMM which are connected based on memory units, speaker difference is reduced through the CNN-HMM, time sequence information of voice is captured through the RNN-LSTM-HMM, context modeling is carried out by utilizing historical information in a sentence, different phonemes are distinguished through the DNN-HMM, recognized phonemes corresponding to input voice information are output in a classified mode, and accuracy of phoneme recognition can be effectively improved. When an original voice signal is obtained, preprocessing the voice signal to extract an effective voice part, and extracting acoustic features from the effective voice part; then, the acoustic features are input into the acoustic model, the model is trained through the trained phonemes to perform phoneme recognition on the acoustic features, and a recognition result is output to the trained memory unit connection-based hybrid neural network model. And finally, outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection, preprocessing the original voice signal, extracting acoustic features, and performing voice recognition through the acoustic model, so that the accuracy of the voice recognition is improved.

Drawings

FIG. 1 is a schematic diagram of an alternative hardware architecture for a server according to the present invention;

FIG. 2 is a schematic diagram of program modules of a first embodiment of the speech recognition system of the present invention;

FIG. 3 is a schematic diagram of program modules of a second embodiment of the speech recognition system of the present invention;

FIG. 4 is a flowchart illustrating a first embodiment of a speech recognition method according to the present invention;

FIG. 5 is a flowchart illustrating a speech recognition method according to a second embodiment of the present invention.

Reference numerals:

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Fig. 1 is a schematic diagram of an alternative hardware architecture of the server 2. In this embodiment, the server 2 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13, which may be communicatively connected to each other through a system bus. It is noted that fig. 1 only shows the server 2 with components 11-13, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the server 2, such as a hard disk or a memory of the server 2. In other embodiments, the memory 11 may also be an external storage device of the server 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the server 2. Of course, the memory 11 may also comprise both an internal storage unit of the server 2 and an external storage device thereof. In this embodiment, the memory 11 is generally used for storing an operating system installed in the server 2 and various types of application software, such as program codes of the speech recognition system 200. Furthermore, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally configured to control the overall operation of the server 2, such as performing control and processing related to data interaction or communication with the terminal device 1. In this embodiment, the processor 12 is configured to operate the program codes stored in the memory 11 or process data, such as operating the speech recognition system 200.

The network interface 13 may comprise a wireless network interface or a wired network interface, and the network interface 13 is generally used for establishing communication connection between the server 2 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the server 2 with one or more other electronic devices through a network, and establish a data transmission channel and a communication connection between the server 2 and the electronic devices.

The application environment and the hardware structure and function of the related devices of the various embodiments of the present invention have been described in detail so far. Hereinafter, various embodiments of the present invention will be proposed based on the above-described application environment and related devices.

First, the present invention provides a speech recognition system 200.

Referring to FIG. 2, a program module diagram of a first embodiment of a speech recognition system 200 according to the present invention is shown.

In this embodiment, the speech recognition system 200 includes a series of computer program instructions stored on the memory 11 that, when executed by the processor 12, may perform speech recognition operations according to embodiments of the present invention. In some embodiments, the speech recognition system 200 may be divided into one or more modules based on the particular operations implemented by the portions of the computer program instructions. For example, in fig. 2, the speech recognition system 200 may be partitioned into a construction module 201, a processing module 202, an extraction module 203, a recognition module 204, and an output module 205. Wherein:

the building module 201 is configured to build an acoustic model, where the acoustic model includes a phoneme training model and a mixed neural network model based on memory unit connection.

Specifically, the server 2 constructs an acoustic model through the construction module 201. In this embodiment, the model includes a phoneme training model and a memory unit connection-based automatic speech recognition technology (CLDNN) of a hybrid Neural Network, which uses a Long-and-short-time recurrent Neural Network (HLSTM-RNN) based on memory unit connection, a high-way Long-and-time recurrent Neural Network (HLSTM-RNN), a Convolutional Neural Network (CNN), a Feed-forward Neural Network (DNN), and a Hidden Markov Model (HMM), and the integrated Deep hybrid Neural Network model is used as a model, and the CNN-HMM is used to reduce speaker difference (different from the difference of a speaker concerned with voiceprint recognition, and the speech recognition is concerned about the content itself, i.e. non-difference), and the acoustic timing information captured by the RNN-LSTM-HMM (acoustic modeling history information in a speech is used to perform context), and then, distinguishing different phonemes through the DNN-HMM, and classifying and outputting the recognized phonemes corresponding to the input voice information. For example, if the user is only the word with the pronunciation of "gong-shi", it is difficult to determine which word is specific, and the pronunciation of each user has a certain difference, such as mandarin chinese homophones, e.g. (formula, work, notations, attack). The time sequence information of the voice is captured through the RNN-LSTM-HMM, and if the captured time sequence information is 'the maximum value of time calculated by using a formula', the 'gong-shi' in the sentence can be determined to be the word of the 'formula' through the context information of the 'gong-shi'.

The processing module 202 is configured to, when an original voice signal is obtained, pre-process the voice signal to extract an effective voice portion.

Specifically, when acquiring an original voice signal, the server 2 preprocesses the original voice signal through the processing module 202. In this embodiment, the pre-processing module 202 first pre-emphasizes the original speech signal to enhance the high frequency portion of the speech signal, so that the frequency spectrum is smoother. The processing module 202 then performs frame windowing on the pre-emphasized speech signal to convert the non-stationary speech signal into a short-time stationary signal. Further, the processing module 202 distinguishes between speech and noise through endpoint detection to remove noise in the short-time stationary signal and extract a valid speech portion. The frequency of human voice is about 65-1100 Hz, in this embodiment, the processing module 202 may set a preset frequency in a frequency range of 65-1100 Hz, remove sounds (i.e. noise) outside the preset frequency range, and extract a short-time stationary signal within the preset frequency range.

The extracting module 203 is configured to extract an acoustic feature from the valid speech portion as an input of the acoustic model.

In particular, the server 2 extracts acoustic features from the valid speech portion by means of the extraction module 203. In this embodiment, the extraction module 203 first performs fourier transform on the effective speech portion to convert the speech signal in the time domain into an energy spectrum in the frequency domain. The extraction module 203 then passes the energy spectrum through a set of mel-scale triangular filter banks to highlight formant features of speech. Further, the logarithmic energy of each filter bank output is calculated. After logarithmic energy calculation, the energy spectrum output by the triangular filter bank is subjected to discrete cosine transform to obtain MFCC coefficients (mel frequency cepstrum coefficient), namely MFCC acoustic characteristics.

The recognition module 204 trains a model through the trained phonemes to perform phoneme recognition on the acoustic features and outputs a recognition result to the trained memory unit connection-based hybrid neural network model.

Specifically, the recognition module 204 trains a model through the trained phonemes to perform phoneme recognition on the acoustic features, wherein the phoneme recognition mainly includes recognition of words and phrases in sentences. In the process of speech recognition, according to the occurrence probability of states in the HMM, namely the comparison similarity of different pronunciations, a path with the maximum occurrence probability is selected in a decoding network as a final output result.

The output module 205 is configured to output text information corresponding to the speech information according to the received recognition result through the trained memory unit connection-based hybrid neural network model.

Specifically, the server 2 outputs text information corresponding to the voice information according to the received recognition result through the trained memory unit connection-based hybrid neural network model. In this embodiment, all nodes of the hybrid neural network model based on the memory unit connection are initialized by uniform random weights in the range of [ -0.05 ], and bias is initially 0. The training of the neural network adopts a cross entropy evaluation criterion (CE: a training evaluation criterion reflecting training output and standard) and an optimized training method of back-propagation-through-time (BPTT) along time truncation. Wherein, each segment of the model contains 20 frames of information, and each minimatch contains 40 pronunciation sentences. Furthermore, in the selection of the momentum factor (momentum: a variable controlling the acceleration of neural network training), the first epoch takes 0, followed by 0.9.

Through the program module 201 and 205, the speech recognition system 200 provided by the present invention first constructs an acoustic model, wherein the acoustic model includes a phoneme training model and a hybrid neural network model based on memory unit connection; then, when an original voice signal is obtained, preprocessing the voice signal to extract an effective voice part; further, extracting acoustic features from the valid speech portion; then, inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained hybrid neural network model based on memory unit connection; and finally, outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection, preprocessing the original voice signal, extracting acoustic features, and performing voice recognition through the acoustic model, so that the accuracy of the voice recognition is improved.

Further, based on the above-described first embodiment of the speech recognition system 200 of the present invention, a second embodiment of the present invention is proposed (as shown in fig. 3). In this embodiment, the speech recognition system 200 further comprises a comparison module 206 and a combination module 207, and the phoneme training models comprise a single-phoneme model and a triple-phoneme model, wherein,

the comparing module 206 is configured to compare the similarity of different phone pronunciations according to the acoustic features through the single-phone model, and output a single-factor alignment result to the triple-phone model through the output module 205.

Generally, each person's pronunciation is different or has a local accent resulting in a substandard pronunciation, and therefore, in the present embodiment, the comparison module 206 compares the similarity of different phoneme pronunciations with dictionary phonemes (standard pronunciations) according to the acoustic features through the monophonic phoneme model, and outputs a monophonic alignment result to the triphone model through the output module 205.

In this embodiment, the process of training the monophonic phone model is as follows: firstly, normalizing input acoustic features, and normalizing variance by default; further, an initialized HMM-GMM model and a decision tree are obtained by utilizing acoustic feature data; then constructing a network for training, constructing an FST network at a phoneme level for decoding each sentence, repeatedly training by continuously aligning feature sequences in the training process to obtain an intermediate statistic, wherein the statistic of an HMM is the occurrence frequency of edges Arc of two phonemes connected in the FST network, the statistic of a GMM is a feature accumulated value and a feature square accumulated value corresponding to each pdf-id, the statistic is related to the updating of two sufficient statistics of the mean of variance of the GMM, and the training of the decoding network is completed by continuously updating a model; and finally, performing forced alignment again for decoding an output result or performing model training of the next stage.

The combining module 207 is configured to combine the influence of the front and rear related phonemes of the current phoneme through the triphone model, and output a forced phoneme alignment result.

Specifically, the triphone model aligns the phonemes one by one, and combines the influence of the current phoneme, that is, the front and back related phonemes of the current aligned phoneme, so as to obtain a more accurate alignment effect, and generate a better recognition result. For example, mandarin is all the same in the same syllable characters (quiet, clean, competitive) and there are homophones, such as (formula, work, show, attack). Through three-factor training, the influence of the front and back related phonemes of the current phoneme, namely the current context and the upper and lower text information can be combined, and the effect of the current phoneme is more accurate. For example, if the user says "zen-me-li-yong-gong-shi-zheng-ming-dig-shi-cheng-li? "when the current phoneme being recognized is" gong-shi ", because there are many homophones for gong-shi, the" gong-shi "can be determined as a formula by the context of the relevant phoneme of" zheng-ming "(proving) and deng-shi (equation) in combination with the context.

In this embodiment, the triphone model training is performed based on delta + delta-delta feature transformation, where the delta + delta-delta feature transformation is performed by adding a delta feature to an original MFCC feature (delta is to perform derivation on the MFCC feature, and delta-delta is to perform second derivation in the same way), the original MFCC feature is 13-dimensional, and an input feature becomes 39-dimensional after the delta + delta-delta feature is added.

The output module 205 is further configured to output the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.

Specifically, the server 2 outputs the forced phoneme alignment result to the memory unit connection-based hybrid neural network model through the output module 205.

In this embodiment, the processing of the hybrid neural network model based on memory cell connections is: and outputting the forced phoneme alignment result to a CNN model. The CNN model has a convolutional layer (convolutional layer) containing 256 convolutional kernels (size 1 × 8 matrix size), each convolutional kernel generating 1 feature map (feature map) for extracting different features; the CNN model also has a non-overlapping pooling layer (max-pooling layer), whose window size and stride (representing that the input is moved one step each time during convolution, both top and bottom) are both 3, and pooling serves to reduce dimensionality of the output result of the convolutional layer. For example, we use 83-dimensional feature vector input here, the parameter becomes 83 × 256 after the first layer of the conditional layer, and becomes (83/3) × 256 after the layer of max-posing layer parameter; despite the reduction of 1/3, the model parameters are too large for speech recognition, for which purpose a projection layer (projection layer) is connected downstream of the max-poolinglayer, the effect of which is to continue the dimensionality reduction, reducing (83/3) × 256 to 256.

Before sending the CNN output to the RNN-LSTM based on the memory cell connection, there is a connection layer (to increase the number of outputs). This is because the feature vector input to the recurrent neural network generally adopts a front-back splicing manner, for example, 5 frames are generally spliced in front of and behind the current frame according to the time sequence for training the timing sequence model RNN, and the CNN is input frame by frame, so that the output of the CNN needs to be conditioned to be input to the LSTM-RNN. The RNN-LSTM model has 3 layers of LSTM layers, each layer of LSTM having 1024 neuron nodes followed by a projection layer with 512 nodes (also for dimensionality reduction). Finally, the output of the last layer of LSTM is input into a fully-connected feed-forward neural network (DNN) model, the DNN model has two layers, each layer has 1024 hidden nodes, the activation function adopts a modified linear activation function (function expression is f (x) ═ max (0, x)), and the result of DNN is then output through a softmax layer for classification and judgment.

Through the program module 206 and 207, the speech recognition system 200 of the present invention can align the phoneme pronunciation through the single-phoneme model, and further, forcibly align the phonemes through the triphone model in combination with the context, thereby improving the accuracy of speech recognition.

In addition, the invention also provides a voice recognition method.

Fig. 4 is a schematic flow chart of the speech recognition method according to the first embodiment of the present invention. In this embodiment, the execution order of the steps in the flowchart shown in fig. 4 may be changed and some steps may be omitted according to different requirements.

Step S301, an acoustic model is constructed, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection.

In particular, the server 2 builds an acoustic model. In this embodiment, the model includes a phoneme training model and a memory unit connection-based automatic speech recognition technology (CLDNN) of a hybrid Neural Network, which uses a Long-and-short-time recurrent Neural Network (HLSTM-RNN) based on memory unit connection, a Feed-forward Neural Network (DNN), and a Hidden Markov Model (HMM), the integrated Deep hybrid Neural Network model is used as a model, speaker difference (different from the difference of a speaker concerned by voiceprint recognition, and the content of speech recognition is not different), and timing information of speech is captured by the RNN-LSTM-HMM (using historical information in a speech to perform context modeling), and then, distinguishing different phonemes through the DNN-HMM, and classifying and outputting the recognized phonemes corresponding to the input voice information. For example, if the user is only the word with the pronunciation of "gong-shi", it is difficult to determine which word is specific, and the pronunciation of each user has a certain difference, such as mandarin chinese homophones, e.g. (formula, work, notations, attack). The time sequence information of the voice is captured through the RNN-LSTM-HMM, and if the captured time sequence information is 'the maximum value of time calculated by using a formula', the 'gong-shi' in the sentence can be determined to be the word of the 'formula' through the context information of the 'gong-shi'.

Step S302, when an original voice signal is obtained, preprocessing is carried out on the voice signal to extract an effective voice part.

Specifically, the server 2 preprocesses the original voice signal when acquiring the original voice signal. In this embodiment, the server 2 first pre-emphasizes the original speech signal to enhance the high frequency part of the speech signal, so that the frequency spectrum is smoother. And then, performing frame division and windowing on the pre-emphasized voice signal to convert the non-stationary voice signal into a short-time stationary signal. Further, voice and noise are distinguished through endpoint detection, so that middle noise of the short-time stationary signal is removed, and an effective voice part is extracted. The frequency of human voice is about 65-1100 Hz, in this embodiment, the processing module 202 may set a preset frequency in a frequency range of 65-1100 Hz, remove sounds (i.e. noise) outside the preset frequency range, and extract a short-time stationary signal within the preset frequency range.

Step S303, extracting acoustic features from the valid speech part as input of the acoustic model.

In particular, the server 2 extracts acoustic features from the valid speech portion. In this embodiment, the server 2 first performs fourier transform on the valid voice part to convert the voice signal in the time domain into an energy spectrum in the frequency domain. The server 2 then passes the energy spectrum through a set of mel-scale triangular filter banks to highlight formant features of speech. Further, the logarithmic energy of each filter bank output is calculated. After logarithmic energy calculation, the energy spectrum output by the triangular filter bank is subjected to discrete cosine transform to obtain MFCC coefficients (mel frequency coefficient), namely MFCC acoustic features.

And step S304, training a model through the trained phonemes to perform phoneme recognition on the acoustic features and outputting a recognition result to the trained memory unit connection-based hybrid neural network model.

Specifically, the server 2 trains a model through the trained phonemes to perform phoneme recognition on the acoustic features, wherein the phoneme recognition mainly comprises recognition of words and phrases in sentences. In the process of speech recognition, according to the occurrence probability of states in the HMM, namely the comparison similarity of different pronunciations, a path with the maximum occurrence probability is selected in a decoding network as a final output result.

Step S305, outputting text information corresponding to the voice information according to the received recognition result through the trained memory unit connection-based hybrid neural network model.

Through the steps S301-305, the speech recognition method provided by the invention comprises the steps of firstly, constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection; then, when an original voice signal is obtained, preprocessing the voice signal to extract an effective voice part; further, extracting acoustic features from the valid speech portion; then, inputting the acoustic features into the acoustic model, training the model through the trained phonemes to perform phoneme recognition on the acoustic features, and outputting a recognition result to the trained hybrid neural network model based on memory unit connection; and finally, outputting text information corresponding to the voice information according to the received recognition result through the trained mixed neural network model based on the memory unit connection, preprocessing the original voice signal, extracting acoustic features, and performing voice recognition through the acoustic model, so that the accuracy of the voice recognition is improved.

Further, based on the above-described first embodiment of the speech recognition method of the present invention, a second embodiment of the speech recognition method of the present invention is proposed.

Fig. 5 is a flow chart of a speech recognition method according to a second embodiment of the present invention. In this embodiment, the phoneme training model includes a monophonic phoneme model and a triphone model, and the step of performing phoneme recognition on the acoustic features through the trained phoneme training model and outputting a recognition result to the trained hybrid neural network model based on memory unit connection specifically includes the following steps:

step S401, comparing the similarity of different phone pronunciations according to the acoustic features through the single phone model, and outputting a single-factor alignment result to the three-phone model through the output module 205.

Generally, each person's pronunciation differs or has a local accent resulting in a nonstandard pronunciation, and therefore, in the present embodiment, the server 2 compares the similarity of different phoneme pronunciations with dictionary phonemes (standard pronunciations) according to the acoustic features through the monophonic model and outputs a monophonic alignment result to the triphone model.

Step S402, combining the influence of the front and back related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result.

Step S403, outputting the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.

Specifically, the server 2 outputs the forced phoneme alignment result to the memory unit connection-based hybrid neural network model.

Before sending the CNN output to the RNN-LSTM based on the memory cell connection, there is a connection layer (to increase the number of outputs). This is because the feature vector input to the recurrent neural network generally adopts a front-back splicing manner, for example, 5 frames are generally spliced in front of and behind the current frame according to the time sequence for training the timing sequence model RNN, and the CNN is input frame by frame, so that the output of the CNN needs to be conditioned before being input to the LSTM-RNN. The RNN-LSTM model has 3 layers of LSTM layers, each layer of LSTM having 1024 neuron nodes followed by a projection layer with 512 nodes (also for dimensionality reduction). Finally, the output of the last layer of LSTM is input into a fully-connected feed-forward neural network (DNN) model, the DNN model has two layers, each layer has 1024 hidden nodes, the activation function adopts a modified linear activation function (function expression is f (x) ═ max (0, x)), and the result of DNN is then output through a softmax layer for classification and judgment.

Through the steps S401 to S403, the speech recognition method provided by the present invention can align the pronunciation of the phoneme through the single phoneme model, and further, forcibly align the phoneme through the triphone model in combination with the context, thereby improving the accuracy of speech recognition.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech recognition method applied to a server is characterized by comprising the following steps:

constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model, the mixed neural network model comprises a long-time recurrent neural network HLSTM-RNN, a convolutional neural network CNN, a feedforward neural network DNN and a hidden Markov model HMM which are connected based on memory units, the speaker difference is reduced through the CNN-HMM, time sequence information of voice is captured through the RNN-LSTM-HMM, context modeling is carried out by utilizing historical information in a sentence, different phonemes are distinguished through the DNN-HMM, and the recognized phonemes corresponding to input voice information are output in a classified mode;

extracting acoustic features from the valid speech portion;

inputting the acoustic features into the acoustic model, performing phoneme recognition on the acoustic features through a trained phoneme training model, selecting a path with the maximum occurrence probability in a decoding network as a recognition result according to the occurrence probability of a state in a Hidden Markov Model (HMM) in the mixed neural network model, outputting the recognition result to an RNN-LSTM model based on memory unit connection in the mixed neural network model, and inputting the output of the last layer of LSTM to the feedforward neural network (DNN);

and outputting text information corresponding to the voice information.

2. The speech recognition method according to claim 1, wherein the step of preprocessing the speech signal to extract an effective speech part when the original speech signal is acquired specifically comprises:

3. The speech recognition method of claim 2, wherein the step of extracting acoustic features from the valid speech portion comprises:

4. The speech recognition method of any one of claims 1-3, wherein the phoneme training models comprise a monophonic model and a triphone model, and the selecting the path with the highest probability of occurrence in the decoding network as the recognition result is output to the RNN-LSTM model based on the memory unit connections in the hybrid neural network model further comprises:

combining the influence of front and rear related phonemes of the current phoneme through the triphone model, and outputting a forced phoneme alignment result to a CNN model in the mixed neural network model;

and outputting the output result of the CNN model to the RNN-LSTM model.

5. The speech recognition method of claim 4, wherein the acoustic feature is MFCC (Mel frequency cepstrum coefficient).

6. A server, comprising a memory, a processor, the memory having stored thereon a speech recognition system operable on the processor, the speech recognition system when executed by the processor performing the steps of:

constructing an acoustic model, wherein the acoustic model comprises a phoneme training model and a mixed neural network model based on memory unit connection, the mixed neural network model comprises a long-time recurrent neural network HLSTM-RNN, a convolutional neural network CNN, a feedforward neural network DNN and a hidden Markov model HMM based on memory unit connection, speaker difference is reduced through the CNN-HMM, time sequence information of voice is captured through the RNN-LSTM-HMM, context modeling is carried out by utilizing historical information in a sentence, different phonemes are distinguished through the DNN-HMM, and the recognized phonemes corresponding to input voice information are output in a classified mode;

extracting acoustic features from the valid speech portion;

inputting the acoustic features into the acoustic model, performing phoneme recognition on the acoustic features through a trained phoneme training model, selecting a path with the maximum occurrence probability in a decoding network as a recognition result according to the occurrence probability of a state in a Hidden Markov Model (HMM) in the mixed neural network model, outputting the path with the maximum occurrence probability to an RNN-LSTM model based on memory unit connection in the mixed neural network model, and inputting the output of the last layer of LSTM to the feedforward neural network DNN;

and outputting text information corresponding to the voice information.

7. The server according to claim 6, wherein the step of preprocessing the speech signal to extract an effective speech part when the original speech signal is acquired specifically includes:

8. The server according to claim 6, wherein the step of extracting acoustic features from the valid speech portion comprises:

and obtaining an acoustic feature by performing discrete cosine transform on the energy spectrum output by the triangular filter bank, wherein the acoustic feature is MFCC (mel frequency cepstrum coefficient).

9. The server according to any one of claims 7-8, wherein the phoneme training models comprise a monophonic model and a triphone model, and the selecting a path with the highest probability of occurrence in the decoding network as the recognition result is output to the RNN-LSTM model based on memory unit connections in the hybrid neural network model further comprises:

and outputting the output result of the CNN model to the RNN-LSTM model based on the memory unit connection.

10. A computer-readable storage medium storing a speech recognition system executable by at least one processor to cause the at least one processor to perform the steps of the speech recognition method according to any one of claims 1-5.